New  Theoretical  Frameworks  for  Machine  Learning 


Maria-Florina  Balcan 

CMU-CS-08-153 
September  15th,  2008 


School  of  Computer  Science 
Carnegie  Mellon  University 
Pittsburgh,  PA  15213 


Thesis  Committee: 

Avrim  Blum,  Chair 
Manuel  Blum 
Yishay  Mansour 
Tom  Mitchell 
Santosh  Vempala 


Submitted  in  partial  fulfillment  of  the  requirements 
for  the  degree  of  Doctor  of  Philosophy. 


Copyright  ©  2008  Maria-Florina  Balcan 

This  research  was  sponsored  by  the  National  Science  Foundation  under  grant  numbers  IIS-0121678,  IIS-0312814,  CCF-05 14922, 
CCR-0122581,  the  U.S.  Army  Research  Office  under  grant  number  DAAD-190213089,  Google,  and  the  IBM  Ph.D.  Fellowship. 
The  views  and  conclusions  contained  in  this  document  are  those  of  the  author  and  should  not  be  interpreted  as  representing  the 
official  policies,  either  expressed  or  implied,  of  any  sponsoring  institution,  the  U.S.  government  or  any  other  entity 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

15  SEP  2008  2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2008  to  00-00-2008 

4.  TITLE  AND  SUBTITLE 

New  Theoretical  Frameworks  for  Machine  Learning 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Carnegie  Mellon  University, School  of  Computer 

Science, Pittsburgh, PA, 15213 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

see  report 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

_ _ _  ABSTRACT 

18.  NUMBER  19a.  NAME  OF 

OF  PAGES  RESPONSIBLE  PERSON 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE  Same  aS 

unclassified  unclassified  unclassified  Report  (SAR) 

196 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


Keywords:  Data  Dependent  Coneept  Spaees,  Clustering,  Value  of  Unlabeled  Data,  Semi-supervised 
Learning,  Aetive  Learning,  Co-training,  Similarity-based  Learning,  Kernels,  Margins,  Low-Dimensional 
Mappings,  Sample  Complexity,  Meehanism  and  Auetion  Design,  Random  Sampling  Meehanisms,  Profit 
Maximization. 


In  memoria  tatalui  meu.  Vei  ramane  vesnic  in  suflet,  inima  si  gand! 


IV 


Abstract 


This  thesis  has  two  primary  thrusts.  The  first  is  developing  new  models  and  algorithms  for  important 
modern  and  classic  learning  problems.  The  second  is  establishing  new  connections  between  Machine 
Learning  and  Algorithmic  Game  Theory. 

The  formulation  of  the  PAG  learning  model  by  Valiant  [201 1  and  the  Statistical  Learning  Theory 
framework  by  Vapnik  [203  j  have  been  instrumental  in  the  development  of  machine  learning  and  the  design 
and  analysis  of  algorithms  for  supervised  learning.  However,  while  extremely  influential,  these  models  do 
not  capture  or  explain  other  important  classic  learning  paradigms  such  as  Clustering,  nor  do  they  capture 
important  emerging  learning  paradigms  such  as  Semi-Supervised  Learning  and  other  ways  of  incorpo¬ 
rating  unlabeled  data  in  the  learning  process.  In  this  thesis,  we  develop  the  first  analog  of  these  general 
discriminative  models  to  the  problems  of  Semi-Supervised  Learning  and  Clustering,  and  we  analyze  both 
their  algorithmic  and  sample  complexity  implications.  We  also  provide  the  first  generalization  of  the  well- 
established  theory  of  learning  with  kernel  functions  to  case  of  general  pairwise  similarity  functions  and 
in  addition  provide  new  positive  theoretical  results  for  Active  Learning.  Finally,  this  dissertation  presents 
new  applications  of  techniques  from  Machine  Learning  to  Algorithmic  Game  Theory,  which  has  been  a 
major  area  of  research  at  the  intersection  of  Computer  Science  and  Economics. 

In  machine  learning,  there  has  been  growing  interest  in  using  unlabeled  data  together  with  labeled 
data  due  to  the  availability  of  large  amounts  of  unlabeled  data  in  many  contemporary  applications.  As  a 
result,  a  number  of  different  semi-supervised  learning  methods  such  as  Co-training,  transductive  SVM, 
or  graph  based  methods  have  been  developed.  However,  the  underlying  assumptions  of  these  methods 
are  often  quite  distinct  and  not  captured  by  standard  theoretical  models.  This  thesis  introduces  a  new 
discriminative  model  (a  PAC  or  Statistical  Learning  Theory  style  model)  for  semi-supervised  learning, 
that  can  be  used  to  reason  about  many  of  the  different  approaches  taken  over  the  past  decade  in  the 
Machine  Learning  community.  This  model  provides  a  unified  framework  for  analyzing  when  and  why 
unlabeled  dafa  can  help  in  fhe  semi-supervised  learning  selling,  in  which  one  can  analyze  bofh  sample- 
complexify  and  algorifhmic  issues.  In  parlicular,  our  model  allows  us  lo  address  in  a  unified  way  key 
issues  such  as  “Under  whal  condilions  will  unlabeled  dafa  help  and  by  how  much?”  and  “How  much  dafa 
should  I  expecl  lo  need  in  order  lo  perform  well?”. 

Anolher  imporlanl  pari  of  Ihis  Ihesis  is  Aclive  Learning  for  which  we  provide  several  new  Iheorelical 
resulls.  In  parlicular.  Ibis  dissertation  includes  fhe  firsl  aclive  learning  algorilhm  which  works  in  Ihe 
presence  of  arbilrary  forms  of  noise,  as  well  as  a  few  margin  based  active  learning  algorilhms. 

In  fhe  conlexl  of  Kernel  melhods  (anolher  flourishing  area  of  machine  learning  research).  Ibis  Ihesis 
shows  how  Random  Projection  techniques  can  be  used  lo  converl  a  given  kernel  function  info  an  explicil, 
dislribulion  dependenl  sel  of  fealures,  which  can  Ihen  be  fed  info  more  general  (nol  necessarily  kernel- 
izable)  learning  algorilhms.  In  addition.  Ibis  work  shows  how  such  melhods  can  be  extended  lo  more 
general  pairwise  similarity  functions  and  also  gives  a  formal  Iheory  lhal  malches  Ihe  slandard  inluilion 
lhal  a  good  kernel  function  is  one  lhal  acls  as  a  good  measure  of  similarity.  We  Ihus  slriclly  generalize  and 
simplify  Ihe  existing  Iheory  of  kernel  melhods.  Our  approach  brings  a  new  perspective  as  well  as  a  much 


V 


simpler  explanation  for  the  effectiveness  of  kernel  methods,  which  can  help  in  the  design  of  good  kernel 
functions  for  new  learning  problems. 

We  also  show  how  we  can  use  this  perspective  to  help  thinking  about  Clustering  in  a  novel  way.  While 
the  study  of  clustering  is  centered  around  an  intuitively  compelling  goal  (and  it  has  been  a  major  tool 
in  many  different  fields),  reasoning  about  it  in  a  generic  and  unified  way  has  been  difficulf,  in  parf  due 
fo  fhe  lack  of  a  general  fheorefical  framework  along  fhe  lines  we  have  for  supervised  classificalion.  In 
our  work  we  develop  fhe  firsl  general  discriminafive  clusfering  framework  for  analyzing  accuracy  wifhouf 
probabilisfic  assumpfions. 

This  disserfafion  also  confribufes  wifh  new  connecfions  befween  Machine  Learning  and  Mechanism 
Design.  Specifically,  Ibis  fhesis  presenfs  fhe  firsf  general  framework  in  which  machine  learning  mefhods 
can  be  used  for  reducing  mechanism  design  problems  fo  sfandard  algorifhmic  questions  for  a  wide  range 
of  revenue  maximizafion  problems  in  an  unlimifed  supply  selling.  Our  resulls  subslanlially  generalize 
fhe  previous  work  based  on  random  sampling  mechanisms  -  bolh  by  broadening  fhe  applicabilily  of  such 
mechanisms  and  by  simplifying  fhe  analysis.  From  a  learning  perspective,  fhese  sellings  presenl  several 
unique  challenges:  fhe  loss  funclion  is  disconfinuous  and  asymmelric,  and  fhe  range  of  bidders’  valualions 
may  be  large. 


VI 


Acknowledgments 


CMU  is  certainly  the  best  place  and  Avrim  Blum  is  certainly  the  best  advisor  for  a  learning  theory  thesis. 
Under  his  guidance  I  was  able  to  think  at  a  fundamental  level  about  a  variety  of  types  of  machine  learning 
questions:  both  classic  and  modern,  both  technical  and  conceptual.  In  addition,  he  shared  with  me  his 
incredibly  sharp  insights  and  great  expertise  in  other  areas,  including  but  not  limited  to  game  theory  and 
algorithms.  I  can  definitely  say  we  have  worked  together  on  the  most  interesting  problems  one  could 
imagine.^  Among  many  other  things,  he  taught  me  how  in  a  research  problem  it  is  most  crucial  above  all, 
to  single  out  the  key  questions  and  then  to  solve  them  most  elegantly. 

During  my  Ph.D  years  I  also  had  the  privilege  to  learn,  observe,  and  “steal”  many  tricks  of  the  trade 
from  other  awesome  researchers.  In  particular,  I  would  like  to  name  my  thesis  committee  members  Manuel 
Blum,  Yishay  Mansour,  Tom  Mitchell,  and  Santosh  Vempala,  as  well  as  two  other  valuable  collaborators: 
Michael  Kearns  and  Tong  Zhang.  Yishay  has  always  been  a  great  source  of  technical  insight  in  a  variety 
of  areas  ranging  from  pure  machine  learning  topics  to  algorithms  or  game  theory.  Santosh  has  also  lent 
me  his  insights  in  seemingly  impossible  technical  questions.  Tom  and  Manuel  besides  being  the  most 
charismatic  faculty  members  at  CMU,  have  also  been  great  models  for  shaping  up  my  own  perspective  on 
computer  science  research.  Michael  and  Tong  have  also  been  particularly  great  to  interact  with. 

I  would  like  to  thank  all  my  other  collaborators  and  co-authors,  inside  and  outside  CMU,  for  making 
both  the  low-level  research  (paper  writing,  slide  preparing,  conference  calling,  coffee  drinking)  and  high 
level  research  (idea  sharing,  reference  pointing,  and  again  coffee  drinking)  a  lot  of  fun.  Particular  gratitude 
goes  to  Alina  Beygelzimer,  John  Langford,  Ke  Yang,  Adam  Kalai,  Anupam  Gupta,  Jason  Hartline,  and 
Nathan  Srebro. 

I  am  grateful  to  George  Necula  for  laying  out  the  Ph.D.  map  for  me,  for  advertising  CMU  as  one  of  the 
best  places  on  Earth,  and  for  all  his  altruistic  advice  in  the  past  seven  years.  Andrew  Gilpin  has  been  my 
only  office  mate  (ever!)  who  has  learnt  key  phrases  in  a  foreign  language  especially  so  that  he  can  interact 
with  me  better  (and  well,  to  visit  Romania  as  well).  Much  encouragement  or  feedback  on  various  things 
(including  various  drafts^  and  talks)  I  have  gotten  from  Steve  Hanneke,  Ke  Yang,  and  Stanislav  Funiak. 

My  undergraduate  days  at  the  University  of  Bucharest  would  have  simply  not  been  the  same  without 
Professors  Luminita  State,  Florentina  Hristea,  and  Ion  Vaduva.  They  have  all  influenced  me  with  their 
style,  taste,  and  charisma. 

I  am  indebted  to  all  my  friends  (older  and  newer)  and  my  family  for  their  support  throughout  the  years. 
Particularly,  I  thank  my  brother  Marius  both  for  his  brilliant  idea  to  nickname  me  Nina  (when  he  was  too 
young  to  appreciate  longer  names)  and  for  coping  with  having  a  studious  sister.^  Most  of  what  I  am  today 
is  due  to  my  parents  loana  and  Dumitru  who,  without  even  planning  to  make  a  scientist  out  of  me,  offered 
me  an  extraordinary  model  on  how  to  be  serious  and  thorough  about  what  I  should  chose  to  do  in  my  life. 

And  of  course,  I  thank  Doru  for  always  being  “one  of  a  kind”. 

'Though  sometimes  it  took  us  a  while  to  formulate  and  agree  on  them. 

^All  the  remaining  typos  in  this  thesis  are  entirely  my  responsibility. 

^Finishing  this  thesis  on  September  15th  is  my  birthday  present  for  him. 

vii 


viii 


Contents 


Abstract  v 

Acknowledgments  vii 

1  Introduction  1 

1.1  Overview .  3 

1.1.1  Incorporating  Unlabeled  Data  in  the  Learning  Process  .  3 

1.1.2  Similarity  Based  Learning .  5 

1.1.3  Clustering  via  Similarity  Functions  .  6 

1.1.4  Mechanism  Design,  Machine  Learning,  and  Pricing  Problems .  8 

1.2  Summary  of  the  Main  Results  and  Bibliographic  Information .  10 

2  A  Discriminative  Framework  for  Semi-Supervised  Learning  13 

2.1  Introduction .  13 

2.1.1  Our  Contribution .  14 

2.1.2  Summary  of  Main  Results .  16 

2.1.3  Structure  of  this  Chapter .  16 

2.2  A  Formal  Framework .  17 

2.3  Sample  Complexity  Results .  19 

2.3.1  Uniform  Convergence  Bounds . 20 

2.3.2  e-Cover-based  Bounds . 27 

2.4  Algorithmic  Results . 32 

2.4.1  A  simple  case . 32 

2.4.2  Co-training  with  linear  separators . 32 

2.5  Related  Models . 36 

2.5.1  A  Transductive  Analog  of  our  Model . 36 

2.5.2  Connections  to  Generative  Models . 37 

2.5.3  Connections  to  the  Luckiness  Framework . 38 

2.5.4  Relationship  to  Other  Ways  of  Using  Unlabeled  Data  for  Learning . 38 

2.6  Conclusions . 39 

2.6.1  Subsequent  Work . 39 

2.6.2  Discussion . 40 

3  A  General  Theory  of  Learning  with  Similarity  Functions  41 

3.1  Learning  with  Kernel  Functions.  Introduction . 41 

3.2  Background  and  Notation . 42 

3.3  Learning  with  More  General  Similarity  Functions:  A  First  Attempt . 44 


IX 


3.3.1  Sufficient  Conditions  for  Learning  with  Similarity  Functions . 44 

3.3.2  Simple  Sufficient  Conditions . 44 

3.3.3  Main  Balcan  -  Blum’06  Conditions . 47 

3.3.4  Extensions . 51 

3.3.5  Relationship  Between  Good  Kernels  and  Good  Similarity  Measures . 52 

3.4  Learning  with  More  General  Similarity  Functions:  A  Better  Definition . 60 

3.4.1  New  Notions  of  Good  Similarity  Functions . 61 

3.4.2  Good  Similarity  Functions  Allow  Learning . 63 

3.4.3  Separation  Results . 67 

3.4.4  Relation  Between  Good  Kernels  and  Good  Similarity  Functions . 70 

3.4.5  Tighteness  . 74 

3.4.6  Learning  with  Multiple  Similarity  Functions  . 75 

3.5  Connection  to  the  Semi-Supervised  Learning  Setting . 75 

3.6  Conclusions . 76 

4  A  Discriminative  Framework  for  Clustering  via  Similarity  Functions  79 

4.1  Introduction . 80 

4.1.1  Perspective . 81 

4.1.2  Our  Results . 82 

4.1.3  Connections  to  other  chapters  and  to  other  related  work . 83 

4.2  Definitions  and  Preliminaries . 83 

4.3  Simple  Properties . 85 

4.4  Weaker  properties . 86 

4.5  Stability-based  Properties . 89 

4.6  Inductive  Setting . 94 

4.7  Approximation  Assumptions . 97 

4.7.1  The  z^-strict  separation  Property . 98 

4.8  Other  Aspects  and  Examples . 99 

4.8.1  Computational  Hardness  Results . 99 

4.8.2  Other  interesting  properties . 100 

4.8.3  Verification . 101 

4.8.4  Examples . 101 

4.9  Conclusions  and  Discussion  . 102 

4. 10  Other  Proofs . 103 

5  Active  Learning  107 

5.1  Agnostic  Active  Learning . 107 

5.1.1  Introduction . 108 

5.1.2  Preliminaries . 110 

5.1.3  The  Agnostic  Active  Learner . 110 

5.1.4  Active  Learning  Speedups . 114 

5.1.5  Subsequent  Work . 119 

5.1.6  Conclusions . 120 

5.2  Margin  Based  Active  Learning . 120 

5.2.1  The  Realizable  Case  under  the  Uniform  Distribution . 121 

5.2.2  The  Non-realizable  Case  under  the  Uniform  Distribution . 125 

5.2.3  Discussion . 128 


X 


5.3  Other  Results  in  Active  Learning . 128 

6  Kernels,  Margins,  and  Random  Projections  129 

6.1  Introduction . 129 

6.2  Notation  and  Definitions . 131 

6.3  Two  simple  mappings . 132 

6.4  An  improved  mapping  . 135 

6.4.1  A  few  extensions . 136 

6.5  On  the  necessity  of  access  to  D . 137 

6.6  Conclusions  and  Discussion  . 139 

7  Mechanism  Design,  Machine  Learning,  and  Pricing  Problems  141 

7.1  Introduction,  Problem  Formulation . 141 

7.2  Model,  Notation,  and  Definitions . 145 

7.2.1  Abstract  Model . 145 

7.2.2  Offers,  Preferences,  and  Incentives . 146 

7.2.3  Quasi-linear  Preferences . 146 

7.2.4  Examples . 147 

7.3  Generic  Reductions . 148 

7.3.1  Generic  Analyses . 149 

7.3.2  Structural  Risk  Minimization . 152 

7.3.3  Improving  the  Bounds . 153 

7.4  The  Digital  Good  Auction  . 155 

7.4.1  Data  Dependent  Bounds . 156 

7.4.2  A  Special  Purpose  Analysis  for  the  Digital  Good  Auction . 156 

7.5  Attribute  Auctions  . 158 

7.5.1  Market  Pricing . 158 

7.5.2  General  Pricing  Functions  over  the  Attribute  Space . 160 

7.5.3  Algorithms  for  Optimal  Pricing  Functions . 161 

7.6  Combinatorial  Auctions . 161 

7.6.1  Bounds  via  Discretization . 162 

7.6.2  Bounds  via  Counting . 164 

7.6.3  Combinatorial  Auctions:  Fower  Bounds . 165 

7.6.4  Algorithms  for  Item-pricing . 166 

7.7  Conclusions  and  Discussion  . 167 

8  Bihliography  169 

A  Additional  Proof  and  Known  Results  179 

A.  1  Appendix  for  Chapter  2 . 179 

A.  1.1  Standard  Results . 179 

A.  1.2  Additional  Proofs . 180 

A.2  Appendix  for  Chapter  5 . 180 

A.2.1  Probability  estimation  in  high  dimensional  ball . 181 

A.3  Appendix  for  Chapter  7 . 182 

A.3.1  Concentration  Inequalities . 182 


XI 


xii 


Chapter  1 

Introduction 


The  formulation  of  the  classic  discriminative  models  for  Supervised  Learning,  namely  the  PAC  learning 
model  by  Valiant  1 20 1 1  and  the  Statistical  Learning  Theory  framework  by  Vapnik  1 203 1 ,  were  instrumental 
in  the  development  of  machine  learning  and  the  design  and  analysis  of  algorithms  for  supervised  learning. 
However,  while  very  influential,  these  models  do  not  capture  or  explain  other  important  classic  learning 
paradigms  such  as  Clustering,  nor  do  they  capture  important  emerging  learning  paradigms  such  as  Semi- 
Supervised  Learning  and  other  ways  of  incorporating  unlabeled  data  in  the  learning  process.  In  this 
thesis,  we  develop  new  frameworks  and  algorithms  for  addressing  key  issues  in  several  important  classic 
and  modern  learning  paradigms.  In  particular,  we  study  Semi-Supervised  Learning,  Active  Learning, 
Learning  with  Kernels  and  more  general  similarity  functions,  as  well  as  Clustering.  In  addition,  we 
present  new  applications  of  techniques  from  Machine  Learning  to  emerging  areas  of  Computer  Science, 
such  as  Auction  and  Mechanism  Design. 

We  start  with  a  high  level  presentation  of  our  work,  and  then  in  Section  1.1  we  give  a  more  detailed 
overview  of  the  main  contributions  of  this  thesis  in  each  of  the  main  directions.  In  Section  1.2  we  summa¬ 
rize  the  main  results  and  describe  the  structure  of  this  thesis,  as  well  as  provide  bibliographic  information. 

New  Frameworks  and  Algorithms  for  Machine  Learning  Over  the  years,  machine  learning  has  grown 
into  a  broad  discipline  that  has  produced  fundamental  theories  of  learning  processes,  as  well  as  learning 
algorithms  that  are  routinely  used  in  commercial  systems  for  speech  recognition,  computer  vision,  and 
spam  detection,  to  name  just  a  few.  The  primary  theoretical  advances  have  been  for  passive  supervised 
learning  problems  1 172 1,  where  a  target  function  (e.g.,  a  classifier)  is  estimated  using  only  labeled  exam¬ 
ples  which  are  considered  to  be  drawn  i.i.d.  from  the  whole  population.  For  example,  in  spam  detection 
an  automatic  classifier  to  label  emails  as  spam  or  not  would  be  trained  using  a  sample  of  previous  emails 
labeled  by  a  human  user.  However,  for  most  contemporary  practical  problems  there  is  often  useful  ad¬ 
ditional  information  available  in  form  of  cheap  and  plentiful  unlabeled  data:  e.g.,  unlabeled  emails  for 
the  spam  detection  problem.  As  a  consequence,  there  has  recently  been  substantial  interest  in  Semi- 
Supervised  Learning,  a  method  for  using  unlabeled  data  together  with  labeled  data  to  improve  learning. 
Several  different  semi-supervised  learning  algorithms  have  been  developed  and  numerous  successful  ex¬ 
perimental  results  have  been  reported.  However  the  underlying  assumptions  of  these  methods  are  quite 
different  and  their  effectiveness  cannot  be  explained  by  standard  learning  models  (the  PAC  model  or  the 
Statistical  Learning  Theory  framework).  While  many  of  these  methods  had  theoretical  justification  under 
specific  assumptions,  there  has  been  no  unified  framework  for  semi-supervised  learning  in  general.  In  this 
thesis,  we  develop  a  comprehensive  theoretical  framework  that  provides  a  unified  way  for  thinking  about 
semi-supervised  learning;  this  model  can  be  used  to  reason  about  many  of  the  different  approaches  taken 


1 


over  the  past  decade  in  the  machine  learning  community.^ 

In  the  context  of  Active  Learning  (another  modern  learning  paradigm  in  which  the  algorithm  can 
interactively  ask  for  the  labels  of  unlabeled  examples  of  its  own  choosing),  we  present  several  new  the¬ 
oretical  results.  In  particular  we  describe  the  first  active  learning  procedure  that  works  in  the  presence 
of  arbitrary  forms  of  noise.  This  procedure  relies  only  upon  the  assumption  that  samples  are  drawn  i.i.d. 
from  some  underlying  distribution  and  it  makes  no  assumptions  about  the  mechanism  producing  the  noise 
(e.g.,  class/target  misfit,  fundamental  randomization,  etc.).  We  also  present  theoretical  justification  for 
margin-based  algorithms  which  have  proven  quite  successful  in  practical  applications,  e.g.,  in  text  classi¬ 
fication  [  199|. 

Another  important  component  of  this  thesis  is  the  development  of  more  intuitive  and  more  operational 
explanations  for  well-established  learning  paradigms,  for  which  a  solid  theory  did  exist,  but  it  was  too 
abstract  and  disconnected  from  practice.  In  particular,  in  the  context  of  Kernel  methods  (a  state  of  the 
art  technique  for  supervised  learning  and  a  flourishing  area  of  research  in  modem  machine  learning),  we 
develop  a  theory  of  learning  with  similarity  functions  that  provides  theoretical  justification  for  the  common 
intuition  that  a  good  kernel  function  is  one  that  acts  as  a  good  measure  of  similarity.  This  theory  is  strictly 
more  general  and  involves  more  tangible  quantities  than  those  used  by  the  traditional  analysis. 

Finally,  we  also  present  a  new  perspective  on  the  classic  Clustering  problem.  Problems  of  clustering 
data  from  pairwise  similarity  information  are  ubiquitous  in  science  and  as  a  consequence  clustering  re¬ 
ceived  substantial  attention  in  many  different  fields  for  many  years.  The  theoretical  work  on  the  topic  has 
generally  been  of  two  types:  either  on  algorithms  for  (approximately)  optimizing  various  distance-based 
objectives  such  as  k-median,  k-means,  and  min-sum,  or  on  clustering  under  probabilistic  “generative 
model”  assumptions  such  as  mixtures  of  Gaussian  or  related  distributions.  In  this  thesis  we  propose  a 
new  approach  to  analyzing  the  problem  of  clustering.  We  consider  the  goal  of  approximately  recovering 
an  unknown  target  clustering  using  a  similarity  function  (or  a  weighted  graph),  given  only  the  assump¬ 
tion  of  certain  natural  properties  that  the  similarity  or  weight  function  satisfies  with  respect  to  the  desired 
clustering.  Building  on  our  models  for  learning  with  similarity  functions  in  the  context  of  supervised 
classification,  we  provide  the  first  general  discriminative  clustering  framework  for  analyzing  clustering 
accuracy  without  probabilistic  assumptions.  In  this  model  we  directly  address  the  fundamental  question 
of  what  kind  of  information  a  clustering  algorithm  needs  in  order  to  produce  a  highly  accurate  clustering 
of  the  data,  and  we  analyze  both  information  theoretic  and  algorithmic  aspects. 

At  a  technical  level,  a  common  characteristic  of  many  of  the  models  we  introduce  to  study  these  learning 
paradigms  (e.g.,  semi- supervised  learning  or  learning  and  clustering  via  similarity  functions)  is  the  use 
of  data  dependent  concept  spaces,  which  we  expect  to  be  a  major  line  of  research  in  the  next  years  in 
machine  learning.  The  variety  of  results  we  present  in  these  models  relies  on  a  very  diverse  set  of  insights 
and  techniques  from  Algorithms  and  Complexity,  Empirical  Processes  and  Statistics,  Optimization,  as 
well  as  Geometry  and  Embeddings. 

Connections  between  Machine  Learning  and  Algorithmic  Game  Theory  This  thesis  also  includes  a 
novel  application  of  machine  learning  techniques  to  automate  aspects  of  Mechanism  Design  and  formally 
address  the  problem  of  market  analysis,  as  well  as  development  of  pricing  algorithms  with  improved 
guarantees  over  previous  methods. 

Developing  algorithms  for  a  highly  distributed  medium  such  as  the  Internet  requires  a  careful  consid¬ 
eration  of  the  objectives  of  the  various  parties  in  the  system.  As  a  consequence.  Mechanism  Design  has 
become  an  increasingly  important  part  of  algorithmic  research  and  computer  science  more  generally  in 

'  This  model  appears  in  a  recent  book  about  Semi-Supervised  Learning  1 27 1  and  it  can  be  used  to  explain  when  and  why 
unlabeled  data  can  help  in  many  of  the  specific  methods  given  in  the  other  chapters  of  the  book. 


2 


recent  years.  Mechanism  design  can  be  thought  of  as  a  distinct  form  of  algorithm  design,  where  a  central 
entity  must  perform  some  computation  (e.g.,  resource  allocation  or  decision  making)  under  the  constraint 
that  the  agents  supplying  the  inputs  have  their  own  interest  in  the  outcome  of  the  computation.  As  a  result, 
it  is  desirable  that  the  employed  procedure  be  incentive  compatible,  meaning  that  it  should  be  in  each 
agent’s  best  interest  to  report  truthfully,  or  to  otherwise  act  in  a  well-behaved  manner.  Typical  examples 
of  such  mechanisms  are  auctions  of  products  (e.g.,  software  packages)  or  pricing  of  shared  resources  (e.g. 
network  links)  where  the  central  entity  would  use  inputs  (bids)  from  the  agents  in  order  to  allocate  goods  in 
a  way  that  maximizes  its  revenue.  Most  of  the  previous  work  on  incentive  compatible  mechanism  design 
for  revenue  maximization  has  been  focused  on  very  restricted  settings  [122,  174 1  (e.g.,  one  item  for  sale 
and/or  single  parameter  agents),  and  many  of  the  previous  incentive  compatible  mechanisms  have  been 
“hand-crafted”  for  the  specific  problem  at  hand.  In  this  thesis  we  use  techniques  from  machine  learning 
to  provide  a  generic  reduction  from  the  incentive-compatible  mechanism  design  question  to  more  stan¬ 
dard  algorithmic  questions,  for  a  wide  variety  of  revenue-maximization  problems,  in  an  unlimited  supply 
setting. 


1.1  Overview 

A  more  detailed  overview  of  this  thesis  follows  below. 

1.1.1  Incorporating  Unlabeled  Data  in  the  Learning  Process 

As  mentioned  earlier,  machine  learning  has  traditionally  focused  on  problems  of  learning  a  task  from 
labeled  examples  only.  However,  for  many  contemporary  practical  problems  such  as  classifying  web 
pages  or  detecting  spam,  there  is  often  additional  information  available;  in  particular,  for  many  of  these 
settings  unlabeled  data  is  often  much  cheaper  and  more  plentiful  than  labeled  data.  As  a  consequence, 
there  has  recently  been  substantial  interest  in  using  unlabeled  data  together  with  labeled  data  for  learning 
[59,  62,  135,  141,  159,  176,  181,  215|,  since  clearly,  if  useful  information  can  be  extracted  from  it  that 
reduces  dependence  on  labeled  examples,  this  can  be  a  significant  benefit  |58,  172|. 

There  are  currently  several  settings  that  have  been  considered  for  incorporating  unlabeled  data  in  the 
learning  process.  Here,  in  addition  to  a  set  of  labeled  examples  drawn  at  random  from  the  underlying  data 
distribution,  it  is  assumed  that  the  learning  algorithm  can  also  use  a  (usually  much  larger)  set  of  unlabeled 
examples  from  the  same  distribution. 

A  first  such  setting  is  passive  Semi-Supervised  Learning  (which  we  will  refer  to  as  SSL)  [6|.  What 
makes  unlabeled  data  so  useful  in  the  SSL  context  and  what  many  of  the  SSL  methods  exploit,  is  that  for 
a  wide  variety  of  learning  problems,  the  natural  regularities  of  the  problem  involve  not  only  the  form  of 
the  function  being  learned  by  also  how  this  function  relates  to  the  distribution  of  data.  For  example,  in 
many  problems  one  might  expect  the  target  function  should  cut  through  low  density  regions  of  the  space, 
a  property  used  by  the  transductive  SVM  algorithm  1 141 1.  In  other  problems  one  might  expect  the  target 
to  be  self-consistent  in  some  way,  a  property  used  by  Co-training  [  62 1 .  Unlabeled  data  is  then  potentially 
useful  in  this  setting  because,  in  principle,  it  allows  one  to  reduce  search  space  from  the  whole  set  of 
hypotheses,  down  to  the  set  of  a-priori  reasonable  ones  with  respect  to  the  underlying  distribution. 

A  second  setting  which  has  been  considered  for  incorporating  unlabeled  data  in  the  learning  process 
which  has  been  increasingly  popular  for  the  past  few  years,  is  Active  Learning  1 86 ,  94 1 .  Here,  the  learning 
algorithm  has  both  the  capability  of  drawing  random  unlabeled  examples  from  the  underlying  distribution, 
and  that  of  asking  for  the  labels  of  any  of  these  examples.  The  hope  is  that  a  good  classifier  can  be  learned 
with  significantly  fewer  labels  by  actively  directing  the  queries  to  informative  examples.  As  opposed  to 


3 


the  SSL  setting,  and  similarly  to  the  classical  supervised  learning  settings  (PAC  and  Statistical  Learning 
Theory  settings)  the  only  prior  belief  about  the  learning  problem  in  the  active  learning  setting  is  that 
the  target  function  (or  a  good  approximation  of  it)  belongs  to  a  given  concept  class.  Luckily,  it  turns 
out  that  for  simple  concept  classes  such  as  linear  separators  on  the  line  one  can  achieve  an  exponential 
improvement  (over  the  usual  supervised  learning  setting)  in  the  labeled  data  sample  complexity,  under  no 
additional  assumptions  about  the  learning  problem  |86,  94 1.  In  general,  however,  for  more  complicated 
concept  classes,  the  speed-ups  achievable  in  the  active  learning  setting  depend  on  the  match  between  the 
distribution  over  example-label  pairs  and  the  hypothesis  class.  Furthermore,  there  are  simple  examples 
where  active  learning  does  not  help  at  all,  not  even  in  the  realizable  case  |94|. 

In  this  thesis  we  study  both  Active  Learning  and  Semi-Supervised  Learning.  For  the  semi-supervised 
learning  problem,  we  provide  a  unified  discriminative  model  (i.e.,  a  PAC  or  Statistical  Learning  Theory 
style  model)  that  captures  many  of  the  ways  unlabeled  data  is  typically  used,  and  provides  a  very  general 
framework  for  thinking  about  this  issue.  This  model  provides  a  unified  framework  for  analyzing  when  and 
why  unlabeled  data  can  help,  in  which  one  can  discuss  both  sample-complexity  and  algorithmic  issues. 
Our  model  can  be  viewed  as  an  extension  of  the  standard  PAC  model,  where  in  addition  to  a  concept  class 
C,  one  also  proposes  a  compatibility  function  (an  abstract  prior):  a  type  of  compatibility  that  one  believes 
the  target  concept  should  have  with  the  underlying  distribution  of  data.  For  example,  such  a  belief  could 
be  that  the  target  should  cut  through  a  low-density  region  of  space,  or  that  it  should  be  self-consistent 
in  some  way  as  in  co-training.  This  belief  is  then  explicitly  represented  in  the  model.  Unlabeled  data 
is  then  potentially  helpful  in  this  setting  because  it  allows  one  to  estimate  compatibility  over  the  space 
of  hypotheses,  and  to  reduce  the  size  of  the  search  space  from  the  whole  set  of  hypotheses  C  down 
to  those  that,  according  to  one’s  assumptions,  are  a-priori  reasonable  with  respect  to  the  distribution. 
After  proposing  the  model,  we  analyze  fundamental  sample-complexity  issues  in  this  setting  such  as 
“How  much  of  each  type  of  data  one  should  expect  to  need  in  order  to  learn  well?”,  and  “What  are  the 
basic  quantities  that  these  numbers  depend  on?”.  We  present  a  variety  of  sample-complexity  bounds, 
both  in  terms  of  uniform-convergence  results — which  apply  to  any  algorithm  that  is  able  to  find  rules 
of  low  error  and  high  compafibilify — as  well  as  e-cover-based  bounds  fhaf  apply  fo  a  more  resfricfed 
class  of  algorifhms  buf  can  be  subsfanfially  fighter.  For  insfance,  we  describe  several  nafural  cases  in 
which  e-cover-based  bounds  can  apply  even  fhough  wifh  high  probabilify  fhere  sfill  exisf  bad  hypofheses 
in  fhe  class  consisfenf  wifh  fhe  labeled  and  unlabeled  examples.  Finally,  we  presenf  several  PAC-sfyle 
algorifhmic  resulfs  in  fhis  model.  Our  main  algorifhmic  resulf  is  a  new  algorifhm  for  Co-Training  wifh 
linear  separafors  fhaf,  if  fhe  disfribufion  satisfies  independence  given  fhe  label,  requires  only  a  single 
labeled  example  fo  learn  fo  any  desired  error  rate  e  and  is  compufafionally  efficienl  (i.e.,  achieves  PAC 
guaranfees).  This  subsfanlially  improves  on  fhe  resulfs  of  |62|  which  required  enough  labeled  examples 
fo  produce  an  inifial  weak  hypofhesis.  We  describe  fhese  resulfs  in  Chapfer  2 . 

For  fhe  acfive  learning  problem,  we  prove  for  fhe  firsf  time,  fhe  feasibilify  of  agnostic  acfive  learning. 
Specifically  we  propose  and  analyze  fhe  firsf  acfive  learning  algorifhm  fhaf  finds  an  e-opfimal  hypofhesis 
in  any  hypofhesis  class,  when  fhe  underlying  disfribufion  has  arbifrary  forms  of  noise.  We  also  analyze 
margin  based  acfive  learning  of  linear  separafors.  We  discuss  fhese  resulfs  in  Chapfer  5  Finally,  we 
mention  recenf  work  in  which  we  have  shown  fhaf  in  an  asympfofic  model  for  active  learning  where  one 
bounds  fhe  number  of  queries  fhe  algorifhm  makes  before  if  finds  a  good  funcfion  (i.e.  one  of  arbifrarily 
small  error  rale),  buf  nof  fhe  number  of  queries  before  if  knows  if  has  found  a  good  funcfion,  one  can 
oblain  significanlly  heller  bounds  on  fhe  number  of  label  queries  required  fo  learn  lhan  in  fhe  Iradilional 
active  learning  models. 

In  addition  fo  being  helpful  in  fhe  semi-supervised  Learning  and  acfive  learning  settings,  unlabeled 
dafa  becomes  useful  in  olher  settings  as  well,  bofh  in  parfially  supervised  learning  models  and,  of  course. 


4 


in  purely  unsupervised  learning  (e.g.,  clustering).  In  this  thesis  we  study  the  use  of  unlabeled  data  in  the 
context  of  learning  with  Kernels  and  more  general  similarity  functions.  We  also  analyze  how  to  effectively 
use  unlabeled  data  for  Clustering  with  non-interactive  feedback.  We  discuss  these  in  turn  below. 

1.1.2  Similarity  Based  Learning 

Kernel  functions  have  become  an  extremely  popular  tool  in  machine  learning,  with  an  attractive  theory 
as  well  [133,  139,  187,  190,  203].  They  are  used  in  domains  ranging  from  Computer  Vision  [132]  to 
Computational  Biology  [  187]  to  Language  and  Text  Processing  [139],  with  workshops,  (e.g.  [2,  3, 4,  5[), 
books  [133,  139,  187,  190]  [203],  and  large  portions  of  major  conferences  (see,  e.g.,  [IJ)  devoted  to  kernel 
methods.  In  this  thesis,  we  strictly  generalize  and  simplify  the  existing  theory  of  Kernel  Methods.  Our 
approach  brings  a  new  perspective  as  well  as  a  much  simpler  explanation  for  the  effectiveness  of  kernel 
methods,  which  can  help  in  the  design  of  good  kernel  functions  for  new  learning  problems. 

A  kernel  is  a  function  that  takes  in  two  data  objects  (which  could  be  images,  DNA  sequences,  or  points 
in  RJ^)  and  outputs  a  number,  with  the  property  that  the  function  is  symmetric  and  positive-semidefinite. 
That  is,  for  any  kernel  K,  there  must  exist  an  (implicit)  mapping  such  that  for  all  inputs  x,  x'  we  have 
K{x,  x')  =  (j){x)  ■  (j){x').  The  kernel  is  then  used  inside  a  “kemelized”  learning  algorithm  such  as  SVM 
or  kemel-perceptron  as  the  way  in  which  the  algorithm  interacts  with  the  data.  Typical  kernel  functions 
for  structured  data  include  the  polynomial  kernel  K{x,x')  =  (1  +  x  •  x'Y  and  the  Gaussian  kernel 
K{x,x')  =  II  ,  and  a  number  of  special-purpose  kernels  have  been  developed  for  sequence 

data,  image  data,  and  other  types  of  data  as  well  [88,  89,  157,  173,  193[. 

The  theory  behind  kernel  functions  is  based  on  the  fact  that  many  standard  algorithms  for  learning 
linear  separators,  such  as  SVMs  and  the  Perceptron  algorithm,  can  be  written  so  that  the  only  way  they 
interact  with  their  data  is  via  computing  dot-products  on  pairs  of  examples.  Thus,  by  replacing  each 
invocation  of  x  ■  x'  with  a  kernel  computation  K{x,x'),  the  algorithm  behaves  exactly  as  if  we  had 
explicitly  performed  the  mapping  (j){x),  even  though  f  may  be  a  mapping  into  a  very  high-dimensional 
space  (dimension  n'^  for  the  polynomial  kernel)  or  even  an  infinite-dimensional  space  (as  in  the  case  of  the 
Gaussian  kernel).  Furthermore,  these  algorithms  have  convergence  rates  that  depend  only  on  the  margin 
of  the  best  separator,  and  not  on  the  dimension  of  the  space  in  which  the  data  resides  [18,  191  [.  Thus, 
kernel  functions  are  often  viewed  as  providing  much  of  the  power  of  this  implicit  high-dimensional  space, 
without  paying  for  it  computationally  (because  the  f  mapping  is  only  implicit)  or  in  terms  of  sample  size 
(if  the  data  is  indeed  well-separated  in  that  space). 

While  the  above  theory  is  quite  elegant,  it  has  a  few  limitations.  First,  when  designing  a  kernel  function 
for  some  learning  problem,  the  intuition  typically  employed  is  that  a  good  kernel  would  be  one  that  serves 
as  a  good  similarity  function  for  the  given  problem  [187[.  On  the  other  hand,  the  above  theory  talks 
about  margins  in  an  implicit  and  possibly  very  high-dimensional  space.  So,  in  this  sense  the  theory  is  not 
that  helpful  for  providing  intuition  when  selecting  or  designing  a  kernel  function.  Second,  it  may  be  that 
the  most  natural  similarity  function  for  a  given  problem  is  not  positive-semidefinite,  and  it  could  require 
substantial  work,  possibly  reducing  the  quality  of  the  function,  to  coerce  it  into  a  legal  form.  Finally,  from 
a  complexity-theoretic  perspective,  it  is  somewhat  unsatisfying  for  the  explanation  of  the  effectiveness  of 
some  algorithm  to  depend  on  properties  of  an  implicit  high-dimensional  mapping  that  one  may  not  even 
be  able  to  calculate.  In  particular,  the  standard  theory  at  first  blush  has  a  “something  for  nothing”  feel  to  it 
(all  the  power  of  the  implicit  high-dimensional  space  without  having  to  pay  for  it)  and  perhaps  there  is  a 
more  prosaic  explanation  of  what  it  is  that  makes  a  kernel  useful  for  a  given  learning  problem.  For  these 
reasons,  it  would  be  helpful  to  have  a  theory  that  involved  more  tangible  quantities. 

In  this  thesis  we  provide  new  theories  that  address  these  limitations  in  two  ways.  First,  we  show  how 
Random  Projection  techniques  can  be  used  to  convert  a  given  kernel  function  into  an  explicit,  distribution 


5 


dependent,  set  of  features,  which  can  then  be  fed  into  more  general  (not  necessarily  kernelizable)  learning 
algorithms.  Conceptually,  this  result  suggests  that  designing  a  good  kernel  function  is  much  like  designing 
a  good  feature  space.  From  a  practical  perspective  it  provides  an  alternative  to  “kernelizing”  a  learning 
algorithm:  rather  than  modifying  the  algorithm  to  use  kernels,  one  can  instead  construct  a  mapping  into  a 
low-dimensional  space  using  the  kernel  and  the  data  distribution,  and  then  run  an  un-kernelized  algorithm 
over  examples  drawn  from  the  mapped  distribution. 

Second,  we  also  show  how  such  methods  can  be  extended  to  more  general  pairwise  similarity  func¬ 
tions  and  also  give  a  formal  theory  that  matches  the  standard  intuition  that  a  good  kernel  function  is  one 
that  acts  as  a  good  measure  of  similarity.  In  particular,  we  define  a  notion  of  what  it  means  for  a  pairwise 
function  K{x,  x')  to  be  a  “good  similarity  function”  for  a  given  learning  problem  that  (a)  does  not  require 
the  notion  of  an  implicit  space  and  allows  for  functions  that  are  not  positive  semi-definite,  (b)  is  provably 
sufficient  for  learning,  and  (c)  is  broad,  in  sense  that  a  good  kernel  in  the  standard  sense  (large  margin  in 
the  implicit  i?i-space)  will  also  satisfy  our  definition  of  a  good  similarity  function,  though  with  some  loss 
in  the  parameters.  This  framework  provides  the  first  rigorous  explanation  for  why  a  kernel  function  that 
is  good  in  the  large-margin  sense  can  also  formally  be  viewed  as  a  good  measure  of  similarity,  thereby 
giving  formal  justification  to  a  common  intuition  about  kernels.  We  start  by  analyzing  a  first  notion  of  a 
good  similarity  function  in  Section  3.3  and  analyze  its  relationship  with  the  usual  notion  of  a  good  kernel 
function.  We  then  present  a  slightly  different  and  broader  notion  that  we  show  it  provides  even  better 
kernels  to  similarity  translation.  Any  large-margin  kernel  function  is  a  good  similarity  function  under 
the  new  definition,  and  while  we  still  incur  some  loss  in  the  parameters,  this  loss  is  much  smaller  than 
under  the  prior  definition,  especially  in  terms  of  the  final  labeled  sample-complexity  bounds.  In  particular, 
when  using  a  valid  kernel  function  as  a  similarity  function,  a  substantial  portion  of  the  previous  sample- 
complexity  bound  can  be  transferred  over  to  merely  a  need  for  unlabeled  examples.  We  also  show  our 
new  notion  is  strictly  more  general  than  the  notion  of  a  large  margin  kernel.  We  discuss  these  results  in 
Section  3.4  In  Chapter  6  other  random  projection  results  for  the  case  where  K  is  in  fact  a  valid  kernel. 

1.1.3  Clustering  via  Similarity  Functions 

Problems  of  clustering  data  from  pairwise  similarity  information  are  ubiquitous  in  science  1 8,  19,  83,  91, 
95 ,  138 ,  146 ,  147 ,151,  205 1.  A  typical  example  task  is  to  cluster  a  set  of  emails  or  documents  according 
to  some  criterion  (say,  by  topic)  by  making  use  of  a  pairwise  similarity  measure  among  data  objects.  In 
this  context,  a  natural  example  of  a  similarity  measure  for  document  clustering  might  be  to  consider  the 
fraction  of  important  words  that  two  documents  have  in  common. 

While  the  study  of  clustering  is  centered  around  an  intuitively  compelling  goal  (and  it  has  been  a  major 
tool  in  many  different  fields),  it  has  been  difficult  to  reason  about  it  at  a  general  level  in  part  due  to  the 
lack  of  a  theoretical  framework  along  the  lines  we  have  for  supervised  classification. 

In  this  thesis  we  develop  the  first  general  discriminative  framework  for  Clustering,  i.e.  a  framework  for 
analyzing  clustering  accuracy  without  making  strong  probabilistic  assumptions.  In  particular,  we  present 
a  theoretical  approach  to  the  clustering  problem  that  directly  addresses  the  fundamental  question  of  how 
good  the  similarity  measure  must  be  in  terms  of  its  relationship  to  the  desired  ground-truth  clustering  (e.g., 
clustering  by  topic)  in  order  to  allow  an  algorithm  to  cluster  well.  Very  strong  properties  and  assumptions 
are  needed  if  the  goal  is  to  produce  a  single  approximately-correct  clustering;  however,  we  show  that  if  we 
relax  the  objective  and  allow  the  algorithm  to  produce  a  hierarchical  clustering  such  that  desired  clustering 
is  close  to  some  pruning  of  this  tree  (which  a  user  could  navigate),  then  we  can  develop  a  general  theory 
of  natural  properties  that  are  sufficient  for  clustering  via  various  kinds  of  algorithms.  This  framework  is 
an  analogue  of  the  PAC  learning  model  for  clustering,  where  the  natural  object  of  study,  rather  than  being 
a  concept  class,  is  instead  a  property  of  the  similarity  information  with  respect  to  the  desired  ground-truth 


6 


Figure  1.1:  Data  lies  in  four  regions  A,  B,  C,  D  (e.g.,  think  of  as  documents  on  baseball,  football,  TCS, 
and  Al).  Suppose  that  K{x,  y)  =  1  if  x  and  y  belong  to  the  same  region,  K{x,  y)  =  1/2  if  x  G  A  and 
y  G  5  or  if  X  G  C  and  y  G  D,  and  K{x,  y)  =  0  otherwise.  Even  assuming  that  all  points  are  more  similar 
to  other  points  in  their  own  cluster  than  to  any  point  in  any  other  cluster,  there  are  still  multiple  consistent 
clusterings,  including  two  consistent  3-clusterings  ((A  U  B,  C,  D)  or  (A,  B,  C  VJ  D)).  However,  there  is 
a  single  hierarchical  decomposition  such  that  any  consistent  clustering  is  a  pruning  of  this  tree. 


clustering. 

As  indicated  above,  the  main  difficulty  that  appears  when  phrasing  the  problem  in  this  general  way  is 
that  if  one  defines  success  as  outputting  a  single  clustering  that  closely  approximates  the  correct  clustering, 
then  one  needs  to  assume  very  strong  conditions  on  the  similarity  function.  For  example,  if  the  function 
provided  by  the  domain  expert  is  extremely  good,  say  K{x,  y)  >  1/2  for  all  pairs  x  and  y  that  should  be 
in  the  same  cluster,  and  K{x,y)  <  1/2  for  all  pairs  x  and  y  that  should  be  in  different  clusters,  then  we 
could  just  use  it  to  recover  the  clusters  in  a  trivial  way.  However,  if  we  just  slightly  weaken  this  condition 
to  simply  require  that  all  points  x  are  more  similar  to  all  points  y  from  their  own  cluster  than  to  any  points 
y  from  any  other  clusters,  then  this  is  no  longer  sufficient  to  uniquely  identify  even  a  good  approximation 
to  the  correct  answer.  For  instance,  in  the  example  in  Figure  1.1,  there  are  multiple  clusterings  consistent 
with  this  property  (one  with  1  cluster,  one  with  2  clusters,  two  with  3  clusters,  and  one  with  4  clusters). 
Even  if  one  is  told  the  correct  clustering  has  3  clusters,  there  is  no  way  for  an  algorithm  to  tell  which  of 
the  two  (very  different)  possible  solutions  is  correct.  In  fact,  results  of  Kleinberg  |151|  can  be  viewed 
as  effectively  ruling  out  a  broad  class  of  scale-invariant  properties  like  this  one  as  being  sufficient  for 
producing  the  correct  answer. 

In  our  work  we  overcome  this  problem  by  considering  two  relaxations  of  the  clustering  objective 
that  are  natural  for  many  clustering  applications.  The  first  is  to  allow  the  algorithm  to  produce  a  small 
list  of  clusterings  such  that  at  least  one  of  them  has  low  erroi^.  The  second  is  (as  mentioned  above)  to 
allow  the  clustering  algorithm  to  produce  a  tree  (a  hierarchical  clustering)  such  that  the  correct  answer  is 
approximately  some  pruning  of  this  tree.  For  instance,  the  example  in  Figure  1.1  has  a  natural  hierarchical 
decomposition  of  this  form.  Both  relaxed  objectives  make  sense  for  settings  in  which  we  imagine  the 
output  being  fed  to  a  user  who  will  then  decide  what  she  likes  best.  For  example,  with  the  tree  relaxation, 
we  allow  the  clustering  algorithm  to  effectively  say:  “1  wasn’t  sure  how  specific  you  wanted  to  be,  so 
if  any  of  these  clusters  are  too  broad,  just  click  and  1  will  split  it  for  you.”  We  then  show  that  with 
these  relaxations,  a  number  of  interesting,  natural  learning-theoretic  and  game-theoretic  properties  can  be 
defined  that  each  are  sufficient  to  allow  an  algorithm  to  cluster  well. 

For  concreteness,  we  shall  summarize  in  the  following  our  main  results.  First,  we  consider  a  family 

^So,  this  is  similar  in  spirit  to  list-decoding  in  coding  theory. 


7 


of  stability-based  properties,  showing  that  a  natural  generalization  of  the  “stable  marriage”  property  is 
sufficient  to  produce  a  hierarchical  clustering.  (The  property  is  that  no  two  subsets  A  C,  A1  C  of 
clusters  C  /  C"  in  the  correct  clustering  are  both  more  similar  on  average  to  each  other  than  to  the  rest  of 
their  own  clusters.)  Moreover,  a  significantly  weaker  notion  of  stability  (which  we  call  “stability  of  large 
subsets”)  is  also  sufficient  to  produce  a  hierarchical  clustering,  but  requires  a  more  involved  algorithm.  We 
also  show  that  a  weaker  “average-attraction”  property  (which  is  provably  not  enough  to  produce  a  single 
correct  hierarchical  clustering)  is  sufficient  to  produce  a  small  list  of  clusterings,  and  give  generalizations 
to  even  weaker  conditions  that  are  related  to  the  notion  of  large-margin  kernel  functions.  We  develop 
a  notion  of  the  clustering  complexity  of  a  given  property  (the  minimum  possible  list  length  that  can  be 
guaranteed  by  any  algorithm)  and  provide  both  upper  and  lower  bounds  for  the  properties  we  consider. 
This  notion  is  analogous  to  notions  of  capacity  in  classification  [72,  103,  203  ]  and  it  provides  a  formal 
measure  of  the  inherent  usefulness  of  a  given  property.  We  show  that  properties  implicitly  assumed  by 
approximation  algorithms  for  standard  graph-based  objective  functions  can  be  viewed  as  special  cases  of 
some  of  the  properties  considered  above. 

We  also  show  how  our  algorithms  can  be  extended  to  the  inductive  case,  i.e.,  by  using  just  a  constant¬ 
sized  sample,  as  in  property  testing.  While  most  of  our  algorithms  extend  in  a  natural  way,  for  certain 
properties  their  analysis  requires  more  involved  arguments  using  regularity-type  results  of  [  14 ,  1 13  j. 

More  generally,  our  framework  provides  a  formal  way  to  analyze  what  properties  of  a  similarity  func¬ 
tion  would  be  sufficient  to  produce  low-error  clusterings,  as  well  as  what  algorithms  are  suited  for  a  given 
property.  For  some  of  our  properties  we  are  able  to  show  that  known  algorithms  succeed  (e.g.  variations  of 
bottom-up  hierarchical  linkage  based  algorithms).  However,  for  the  most  general  ones,  e.g.,  the  stability 
of  large  subsets  property,  we  need  new  algorithms  that  are  able  to  take  advantage  of  them.  In  fact,  the  al¬ 
gorithm  we  develop  for  the  stability  of  the  large  subsets  property  combines  learning-theoretic  approaches 
used  in  Chapter  3  (and  described  in  Section  1.1.2)  with  linkage-style  methods.  We  describe  these  results 
in  Chapter  4 

1.1.4  Mechanism  Design,  Machine  Learning,  and  Pricing  Problems 

In  this  thesis  we  also  present  explicit  connections  between  Machine  Learning  Theory  and  certain  contem¬ 
porary  problems  in  Economics. 

With  the  Internet  developing  as  the  single  most  important  arena  for  resource  sharing  among  parties 
with  diverse  and  selfish  inferesfs,  fradifional  algorifhmic  and  disfribufed  sysfems  need  fo  be  combined  wifh 
fhe  undersfanding  of  game-fheorefic  and  economic  issues  [  177].  A  fundamenfal  research  endeavor  in  fhis 
new  field  is  fhe  design  and  analysis  of  auction  mechanisms  and  pricing  algorifhm  [70,  121 ,  124,  129,  129|. 
In  fhis  fhesis  we  show  how  machine  learning  mefhods  can  be  used  in  fhe  design  of  aucfions  and  ofher 
pricing  mechanisms  wifh  guarantees  on  fheir  performance. 

In  particular,  we  show  how  sample  complexify  fechniques  from  sfafisfical  learning  fheory  can  be  used 
fo  reduce  problems  of  incenfive-compafible  mechanism  design  fo  sfandard  algorifhmic  questions,  for  a 
wide  range  of  revenue-maximizing  problems  in  an  unlimifed  supply  selling.  In  doing  so,  we  obfain  a 
unified  approach  for  considering  a  variely  of  profif  maximizing  mechanism  design  problems,  including 
many  lhal  have  been  previously  considered  in  fhe  lileralure.  We  show  how  fechniques  from  in  machine 
learning  fheory  can  be  used  bolh  for  analyzing  and  designing  our  mechanisms.  We  apply  our  reduclions 
fo  a  diverse  sel  of  revenue  maximizing  pricing  problems,  such  as  fhe  problem  of  aucfioning  a  digifal  good, 
fhe  alfribule  auclion  problem,  and  fhe  problem  of  item  pricing  in  unlimifed  supply  combinalorial  auctions. 

For  concreleness,  in  fhe  following  paragraphs,  we  shall  give  more  delails  on  fhe  selling  we  sludy  in 
our  work.  Consider  a  seller  wifh  mulliple  digifal  goods  or  services  for  sale,  such  as  movies,  soflware, 
or  nelwork  services,  over  which  buyers  may  have  complicated  preferences.  In  order  fo  sell  fhese  items 


through  an  incentive-compatible  auction  mechanism,  this  mechanism  should  have  the  property  that  each 
bidder  is  offered  a  set  of  prices  that  do  not  depend  on  the  value  of  her  bid.  The  problem  of  designing 
a  revenue-maximizing  auction  is  known  in  the  economics  literature  as  the  optimal  auction  design  prob¬ 
lem  [171|.  The  classical  model  for  optimal  auction  design  assumes  a  Bayesian  setting  in  which  players’ 
valuations  (types)  are  drawn  from  some  probability  distribution  that  furthermore  is  known  to  the  mech¬ 
anism  designer.  For  example,  to  sell  a  single  item  of  fixed  marginal  cost,  one  should  set  the  price  that 
maximizes  the  profit  margin  per  sale  times  the  probability  a  random  person  would  be  willing  to  buy  at 
that  price.  However,  in  complex  or  non-static  environments,  these  assumptions  become  unrealistic.  In 
these  settings,  machine  learning  can  provide  a  natural  approach  to  the  design  of  near-optimal  mechanisms 
without  such  strong  assumptions  or  degree  of  prior  knowledge. 

Specifically,  notice  fhaf  while  a  Irufhful  auction  mechanism  should  have  fhe  properly  fhaf  fhe  prices 
offered  lo  some  bidder  i  do  nol  depend  on  fhe  value  of  her  bid,  fhey  can  depend  on  fhe  amounls  bid  by  olher 
bidders  j.  From  a  machine  learning  perspecfive,  fhis  is  very  similar  lo  Ihinking  of  bidders  as  “examples” 
and  our  objective  being  lo  use  information  from  examples  j  ^  ito  produce  a  good  prediclion  wifh  respecl 
lo  example  i.  Thus,  wilhoul  presuming  a  known  dislribulion  over  bidders  (or  even  lhal  bidders  come 
from  any  dislribulion  al  all)  perhaps  if  Ihe  number  of  bidders  is  sufficienlly  large,  enough  information 
can  be  learned  from  some  of  Ihem  lo  perform  well  on  Ihe  rest  In  Ihis  Ihesis  we  formalize  Ihis  idea  and 
show  indeed  lhal  sample-complexily  lechniques  from  machine  learning  Iheory  [18,  203]  can  be  adapted 
lo  Ihis  selling  lo  give  quanlilalive  bounds  for  Ihis  kind  of  approach.  More  generally,  we  show  lhal  sample 
complexify  analysis  can  be  applied  lo  convert  incentive-compatible  mechanism  design  problems  lo  more 
slandard  algorilhm-design  questions,  in  a  wide  variety  of  revenue-maximizing  auction  sellings. 

Our  reductions  imply  lhal  for  Ihese  problems,  given  an  algorilhm  for  Ihe  non  incentive-compatible 
pricing  problem,  we  can  converl  if  info  an  algorilhm  for  Ihe  incentive-compatible  mechanism  design  prob¬ 
lem  lhal  is  only  a  factor  of  (1  -|-  e)  worse,  as  long  as  Ihe  number  of  bidders  is  sufficienlly  large  as  a  function 
of  an  appropriate  measure  of  complexity  of  Ihe  class  of  allowable  pricing  functions.  We  apply  Ihese  resulls 
to  Ihe  problem  of  auctioning  a  digilal  good,  to  Ihe  allribule  auction  problem  which  includes  a  wide  variety 
of  discriminatory  pricing  problems,  and  to  Ihe  problem  of  item-pricing  in  unlimited-supply  combinatorial 
auclions.From  a  machine  learning  perspective,  Ihese  sellings  presenl  several  challenges:  in  particular,  Ihe 
loss  function  is  discontinuous,  is  asymmelric,  and  has  a  large  range. 

The  high  level  idea  of  our  mosl  basic  reduction  is  based  on  Ihe  notion  of  a  random  sampling  auction. 
For  concreteness,  lei  us  imagine  we  are  selling  a  collection  of  n  goods  or  services  of  zero  marginal  cosl 
to  us,  to  n  bidders  who  may  have  complex  preference  functions  over  Ihese  items,  and  our  objective  is  to 
achieve  revenue  comparable  to  Ihe  besl  possible  assignmenl  of  prices  to  Ihe  various  items  we  are  selling. 
So,  technically  speaking,  we  are  in  Ihe  selling  of  maximizing  revenue  in  an  unlimited  supply  combinatorial 
auction.  Then  given  a  sel  of  bids  S,  we  perform  Ihe  following  operations.  We  firsl  randomly  partition  S 
into  Iwo  sels  5i  and  82-  We  Ihen  consider  Ihe  purely  algorilhmic  problem  of  finding  Ihe  besl  sel  of  prices 
Pi  for  Ihe  sel  of  bids  5i  (which  may  be  difficull  bul  is  purely  algorilhmic),  and  Ihe  besl  sel  of  prices  p2 
for  Ihe  sel  of  bids  82-  We  Ihen  use  pi  as  offer  prices  for  bidders  in  82,  giving  each  bidder  Ihe  bundle 
maximizing  revealed  valuation  minus  price,  and  use  p2  as  offer  prices  for  bidders  in  ^i.  We  Ihen  show 
lhal  even  if  bidders’  preferences  are  exlremely  complicated,  Ihis  mechanism  will  achieve  revenue  close  to 
lhal  of  Ihe  besl  fixed  assignmenl  of  prices  to  items  so  long  as  Ihe  number  of  bidders  is  sufficienlly  large 
compared  to  Ihe  number  of  items  for  sale.  For  example,  if  all  bidders’  valuations  on  Ihe  grand  bundle  of 
all  n  items  lie  in  Ihe  range  [l,h],  Ihen  0{hn/e^)  bidders  are  sufficienl  so  lhal  wilh  high  probability,  we 
come  wilhin  a  (1  -|-  e)  factor  of  Ihe  optimal  fixed  item  pricing.  Or,  if  we  cannol  solve  Ihe  algorilhmic 
problem  exaclly  (since  many  problems  of  Ihis  form  are  often  NP-hard  [25,  26,  32,  129]),  we  lose  only  a 
(1  -|-  e)  factor  over  whatever  approximation  our  melhod  for  solving  Ihe  algorilhmic  problem  gives  us. 


9 


More  generally,  these  methods  apply  to  a  wide  variety  of  pricing  problems,  including  those  in  which 
bidders  have  both  public  and  private  information,  and  also  give  a  formal  framework  in  which  one  can 
address  other  interesting  design  issues  such  as  how  fine-grained  a  market  segmentation  should  be.  This 
framework  provides  a  unified  approach  to  considering  a  variety  of  profit  maximizing  mechanism  design 
problems  including  many  that  have  been  previously  considered  in  the  literature.  Furthermore,  our  re¬ 
sults  substantially  generalize  the  previous  work  on  random  sampling  mechanisms  by  both  broadening  the 
applicability  of  such  mechanisms  and  by  simplifying  the  analysis. 

Some  of  our  techniques  give  suggestions  for  the  design  of  mechanisms  and  others  for  their  analysis. 
In  terms  of  design,  these  include  the  use  of  discretization  to  produce  smaller  function  classes,  and  the  use 
of  structural-risk  minimization  to  choose  an  appropriate  level  of  complexity  of  the  mechanism  for  a  given 
set  of  bidders.  In  terms  of  analysis,  these  include  both  the  use  of  basic  sample-complexity  arguments,  and 
the  notion  of  multiplicative  covers  for  better  bounding  the  true  complexity  of  a  given  class  of  offers. 

Finally,  from  a  learning  perspective,  this  mechanism-design  setting  presents  a  number  of  technical 
challenges  when  attempting  to  get  good  bounds:  in  particular,  the  payoff  function  is  discontinuous  and 
asymmetric,  and  the  payoffs  for  different  offers  are  non-uniform.  For  example,  we  develop  bounds  based 
on  a  different  notion  of  covering  number  than  typically  used  in  machine  learning,  in  order  to  obtain  results 
that  are  more  meaningful  for  this  mechanism  design  setting.  We  describe  these  results  in  Chapter  7 . 


1.2  Summary  of  the  Main  Results  and  Bibliographic  Information 

This  thesis  is  organized  as  follows. 

•  In  Chapter  2  we  present  the  first  general  discriminative  model  for  Semi-Supervised  learning.  In  this 
model  we  provide  a  variety  of  algorithmic  and  sample  complexity  results  and  we  also  show  how  it 
can  be  used  to  reason  about  many  of  the  different  semi-supervised  learning  approaches  taken  over 
the  past  decade  in  the  machine  learning  community.  Much  of  this  chapter  is  based  on  work  that 
appears  in  |23|,  |27|.  Other  related  work  we  have  done  on  Co-training  (which  we  briefly  mention) 
appears  in  [28 1. 

•  In  Chapter  3  we  provide  a  theory  of  learning  with  general  similarity  functions  (that  is,  functions 
which  are  not  necessarily  legal  kernels).  This  theory  provides  conditions  on  the  suitability  of  a 
similarity  function  for  a  given  learning  problem  in  terms  of  more  tangible  and  more  operational 
quantities  than  those  used  by  the  standard  theory  of  kernel  functions.  In  addition  to  being  provably 
more  general  than  the  standard  theory,  our  framework  provides  the  first  rigorous  explanation  for 
why  a  kernel  function  that  is  good  in  the  large-margin  sense  can  also  formally  be  viewed  as  a  good 
measure  of  similarity,  thereby  giving  formal  justification  to  a  common  intuition  about  kernels.  In 
this  chapter  we  analyze  both  algorithmic  and  sample  complexity  issues,  and  this  is  mostly  based  on 
work  that  appears  in  [24],  [38],  and  |39j. 

•  In  Chapter  4  we  study  Clustering  and  we  present  the  first  general  framework  for  analyzing  clustering 
accuracy  without  probabilistic  assumptions.  Again,  in  this  chapter  we  consider  both  algorithmic  and 
information  theoretic  aspects.  This  is  mainly  based  on  work  that  appears  in  [40],  but  also  includes 
parts  from  the  recent  work  in  [42|. 

•  In  Chapter  5  we  analyze  Active  Learning  and  present  two  main  results.  In  Section  5.L  we  provide  a 
generic  active  learning  algorithm  that  works  in  the  presence  of  arbitrary  forms  of  noise.  This  section 
is  focused  mostly  on  sample  complexity  aspects  and  the  main  contribution  here  is  to  provide  the 
first  positive  result  showing  that  active  learning  can  provide  a  significant  improvement  over  passive 
learning  even  in  the  presence  of  arbitrary  forms  of  noise.  In  Section  5.2  we  analyze  a  natural 


10 


margin-based  active  learning  strategy  for  learning  linear  separators  (which  queries  points  near  the 
hypothesized  decision  boundary).  We  provide  a  detailed  analysis  (both  sample  complexity  and 
algorithmic)  both  in  the  realizable  case  and  in  a  specific  noisy  setting  related  to  the  Tsybakov  noise 
condition.  This  chapter  is  based  on  work  that  appears  in  [30|,  [35|,  and  |33|.  We  also  briefly 
mention  other  recent  work  on  the  topic  L41J. 

•  In  Chapter  6  we  present  additional  results  on  learning  with  kernel  functions.  Specifically,  we  show 
how  Random  Projection  techniques  can  be  used  to  “demystify”  kernel  functions.  We  show  that 
in  the  presence  of  a  large  margin,  a  kernel  can  be  efficiently  converted  into  a  mapping  to  a  low 
dimensional  space;  in  particular,  we  present  a  computationally  efficient  procedure  that,  given  black¬ 
box  access  to  the  kernel  and  unlabeled  data,  generates  a  small  number  of  features  that  approximately 
preserve  both  separability  and  margin.  This  is  mainly  based  on  work  that  appears  in  [31 1. 

•  In  Chapter  7  we  show  how  model  selection  and  sample  complexity  techniques  in  machine  learning 
can  be  used  to  convert  difficult  mechanism  design  problems  to  more  standard  algorithmic  questions 
for  a  wide  range  of  pricing  problems.  We  present  a  unified  approach  for  considering  a  variety  of 
profit  maximizing  mechanism  design  problems,  such  as  the  problem  of  auctioning  a  digital  good,  the 
attribute  auction  problem  (which  includes  many  discriminatory  pricing  problems),  and  the  problem 
of  item  pricing  in  unlimited  supply  combinatorial  auctions.  These  results  substantially  generalize 
the  previous  work  on  random  sampling  mechanisms  by  both  broadening  the  applicability  of  such 
mechanisms  (e.g.,  to  multi-parameter  settings),  and  by  simplifying  and  refining  the  analysis.  This 
chapter  is  mainly  based  on  work  that  appears  in  [29  ]  and  [36]  and  it  is  focused  on  using  machine 
learning  techniques  for  providing  a  generic  reduction  from  the  incentive-compatible  mechanism 
design  question  to  more  standard  algorithmic  questions,  without  also  attempting  to  address  the 
algorithmic  questions  as  well.  In  other  related  work  (which  for  coherence  and  space  limitations  is 
not  included  in  this  thesis)  we  have  also  considered  various  algorithmic  problems  that  arise  in  this 
context  [26],  [25],  [32]  and  [37|. 

While  we  discuss  both  technical  and  conceptual  connections  between  the  various  learning  protocols 
and  paradigms  studied  throughout  the  thesis,  each  chapter  can  also  be  read  somewhat  independently. 


11 


12 


Chapter  2 

A  Discriminative  Framework  for 
Semi-Supervised  Learning 


There  has  recently  been  substantial  interest  in  semi-supervised  learning  —  a  paradigm  for  incorporating 
unlabeled  data  in  the  learning  process  —  since  any  useful  information  that  reduces  the  amount  of  labeled 
data  needed  for  learning  can  be  a  significant  benefit.  Several  techniques  have  been  developed  for  doing 
this,  along  with  experimental  results  on  a  variety  of  different  learning  problems.  Unfortunately,  the  stan¬ 
dard  learning  frameworks  for  reasoning  about  supervised  learning  do  not  capture  the  key  aspects  and  the 
assumptions  underlying  these  5em/-supervised  learning  methods. 

In  this  chapter  we  describe  an  augmented  version  of  the  PAC  model  designed  for  semi-supervised 
learning,  that  can  be  used  to  reason  about  many  of  the  different  approaches  taken  over  the  past  decade  in 
the  Machine  Learning  community.  This  model  provides  a  unified  framework  for  analyzing  when  and  why 
unlabeled  dafa  can  help  in  fhe  semi-supervised  learning  selling,  in  which  one  can  analyze  bolh  sample- 
complexify  and  algorifhmic  issues.  The  model  can  be  viewed  as  an  exfension  of  fhe  sfandard  PAC  model 
where,  in  addilion  lo  a  concepl  class  C,  one  also  proposes  a  compalibilily  nofion:  a  fype  of  compalibilily 
lhal  one  believes  fhe  largel  concepl  should  have  wilh  fhe  underlying  dislribulion  of  dafa.  Unlabeled  dafa 
is  Ihen  polenlially  helpful  in  Ihis  selling  because  if  allows  one  lo  estimate  compalibilily  over  Ihe  space  of 
hypolheses,  and  lo  reduce  fhe  size  of  fhe  search  space  from  fhe  whole  sel  of  hypolheses  C  down  lo  Ihose 
lhal,  according  lo  one’s  assumptions,  are  a-priori  reasonable  wilh  respecl  lo  Ihe  dislribulion.  As  we  show, 
many  of  Ihe  assumptions  underlying  existing  semi-supervised  learning  algorilhms  can  be  formulated  in 
Ihis  framework. 

After  proposing  Ihe  model,  we  Ihen  analyze  sample-complexily  issues  in  Ihis  selling:  lhal  is,  how 
much  of  each  type  of  dala  one  should  expecl  lo  need  in  order  lo  learn  well,  and  whal  Ihe  key  quantities  are 
lhal  Ihese  numbers  depend  on.  Our  work  is  Ihe  firsl  lo  address  such  imporlanl  questions  in  Ihe  conlexl  of 
semi-supervised  learning  in  a  unified  way.  We  also  consider  Ihe  algorilhmic  question  of  how  lo  efficienlly 
optimize  for  nalural  classes  and  compatibility  notions,  and  provide  several  algorilhmic  resulls  including 
an  improved  bound  for  Co-Training  wilh  linear  separators  when  Ihe  dislribulion  satisfies  independence 
given  Ihe  label. 

2.1  Introduction 

As  mentioned  in  Chapter  1 ,  given  Ihe  easy  availability  of  unlabeled  dala  in  many  sellings,  Ihere  has  been 
growing  inleresl  in  melhods  lhal  fry  to  use  such  dala  logelher  wilh  Ihe  (more  expensive)  labeled  dala 
for  learning.  In  particular,  a  number  of  semi-supervised  learning  techniques  have  been  developed  for 


13 


doing  this,  along  with  experimental  results  on  a  variety  of  different  learning  problems.  These  include 
label  propagation  for  word-sense  disambiguation  [210|,  co-training  for  classifying  web  pages  |62|  and 
improving  visual  detectors  [159],  transductive  SVM  |141|  and  EM  [176|  for  text  classification,  graph- 
based  methods  [215 1,  and  others.  The  problem  of  learning  from  labeled  and  unlabeled  data  has  been  the 
topic  of  several  ICML  workshops  [15,  1 17[  as  well  as  a  recent  book  [82[  and  survey  article  L214J. 

What  makes  unlabeled  data  so  useful  and  what  many  of  these  methods  exploit,  is  that  for  a  wide  variety 
of  learning  problems,  the  natural  regularities  of  the  problem  involve  not  only  the.  form  of  the  function  being 
learned  by  also  how  this  function  relates  to  the  distribution  of  data.  For  example,  in  many  problems  one 
might  expect  the  target  function  should  cut  through  low  density  regions  of  the  space,  a  property  used  by 
the  transductive  SVM  algorithm  [  141  [.  In  other  problems  one  might  expect  the  target  to  be  self-consistent 
in  some  way,  a  property  used  by  Co-training  [62[.  Unlabeled  data  is  potentially  useful  in  these  settings 
because  it  then  allows  one  to  reduce  the  search  space  to  a  set  which  is  a-priori  reasonable  with  respect  to 
the  underlying  distribution. 

Unfortunately,  however,  the  underlying  assumptions  of  these  semi-supervised  learning  methods  are 
not  captured  well  by  standard  theoretical  models.  The  main  goal  of  this  chapter  is  to  propose  a  unified  the¬ 
oretical  framework  for  semi-supervised  learning,  in  which  one  can  analyze  when  and  why  unlabeled  data 
can  help,  and  in  which  one  can  discuss  both  sample-complexity  and  algorithmic  issues  in  a  discriminative 
(PAC-model  style)  framework. 

One  difficulty  from  a  theoretical  point  of  view  is  that  standard  discriminative  learning  models  do  not 
allow  one  to  specify  relations  that  one  believes  the  target  should  have  with  the  underlying  distribution. 
In  particular,  both  in  the  PAC  model  [69,  149,  201  [  and  the  Statistical  Learning  Theory  framework  [203[ 
there  is  purposefully  a  complete  disconnect  between  the  data  distribution  D  and  the  target  function  / 
being  learned.  The  only  prior  belief  is  that  /  belongs  to  some  class  C :  even  if  the  data  distribution  D  is 
known  fully,  any  function  /  G  C  is  still  possible.  For  instance,  in  the  PAC  model,  it  is  perfectly  natural 
(and  common)  to  talk  about  the  problem  of  learning  a  concept  class  such  as  DNF  formulas  [162,  206  [ 
or  an  intersection  of  halfspaces  [47,  61 ,  153,  204[  over  the  uniform  distribution;  but  clearly  in  this  case 
unlabeled  data  is  useless  —  you  can  just  generate  it  yourself.  For  learning  over  an  unknown  distribution, 
unlabeled  data  can  help  somewhat  in  the  standard  models  (e.g.,  by  allowing  one  to  use  distribution-specific 
algorithms  and  sample-complexity  bounds  [53,  144J),  but  this  does  not  seem  to  capture  the  power  of 
unlabeled  data  in  practical  semi-supervised  learning  methods. 

In  generative  models,  one  can  easily  talk  theoretically  about  the  use  of  unlabeled  data,  e.g.,  [76,  77 [. 
However,  these  results  typically  make  strong  assumptions  that  essentially  imply  that  there  is  only  one 
natural  distinction  to  be  made  for  a  given  (unlabeled)  data  distribution.  For  instance,  a  typical  generative 
model  would  be  that  we  assume  positive  examples  are  generated  by  one  Gaussian,  and  negative  examples 
are  generated  by  another  Gaussian.  In  this  case,  given  enough  unlabeled  data,  we  could  in  principle 
recover  the  Gaussians  and  would  need  labeled  data  only  to  tell  us  which  Gaussian  is  the  positive  one  and 
which  is  the  negative  one.^  However,  this  is  too  strong  an  assumption  for  most  real-world  settings.  Instead, 
we  would  like  our  model  to  allow  for  a  distribution  over  data  (e.g.,  documents  we  want  to  classify)  where 
there  are  a  number  of  plausible  distinctions  we  might  want  to  make.  In  addition,  we  would  like  a  general 
framework  that  can  be  used  to  model  many  different  uses  of  unlabeled  data. 

2.1.1  Our  Contribution 

In  this  chapter,  we  present  a  discriminative  (PAC-style  framework)  that  bridges  between  these  positions 
and  can  be  used  to  help  think  about  and  analyze  many  of  the  ways  unlabeled  data  is  typically  used.  This 

'|76  77 1  do  not  assume  Gaussians  in  particular,  but  they  do  assume  the  distributions  are  distinguishable,  which  from  this 
perspective  has  the  same  issue. 


14 


framework  extends  the  PAC  learning  model  in  a  way  that  allows  one  to  express  not  only  the  form  of 
target  function  one  is  considering,  but  also  relationships  that  one  hopes  the  target  function  and  underlying 
distribution  will  possess.  We  then  analyze  both  sample-complexity  issues — that  is,  how  much  of  each 
type  of  data  one  should  expect  to  need  in  order  to  learn  well — as  well  as  algorithmic  results  in  this  model. 
We  derive  bounds  for  both  the  realizable  (PAC)  and  agnostic  (statistical  learning  framework)  settings. 

Specifically,  the  idea  of  the  proposed  model  is  to  augment  the  PAC  notion  of  a  concept  class,  which 
is  a  set  of  functions  (such  as  linear  separators  or  decision  trees),  with  a  notion  of  compatibility  between 
a  function  and  the  data  distribution  that  we  hope  the  target  function  will  satisfy.  Rather  than  talking  of 
“learning  a  concept  class  C,”  we  will  talk  of  “learning  a  concept  class  C  under  compatibility  notion  x” 
For  example,  suppose  we  believe  there  should  exist  a  low-error  linear  separator,  and  that  furthermore,  if 
the  data  happens  to  cluster,  then  this  separator  does  not  slice  through  the  middle  of  any  such  clusters.  Then 
we  would  want  a  compatibility  notion  that  penalizes  functions  that  do,  in  fact,  slice  through  clusters.  In 
this  framework,  the  ability  of  unlabeled  data  to  help  depends  on  two  quantities:  first,  the  extent  to  which 
the  target  function  indeed  satisfies  the  given  assumptions,  and  second,  the  extent  to  which  the  distribution 
allows  this  assumption  to  rule  out  alternative  hypotheses.  For  instance,  if  the  data  does  not  cluster  at 
all  (say  the  underlying  distribution  is  uniform  in  a  ball),  then  all  functions  would  equally  satisfy  this 
compatibility  notion  and  the  assumption  is  not  useful.  From  a  Bayesian  perspective,  one  can  think  of  this 
as  a  PAC  model  for  a  setting  in  which  one’s  prior  is  not  just  over  functions,  but  also  over  how  the  function 
and  underlying  distribution  relate  to  each  other. 

To  make  our  model  formal,  we  will  need  to  ensure  that  the  degree  of  compatibility  be  something  that 
can  be  estimated  from  a  finite  sample.  To  do  this,  we  will  require  that  the  compatibility  notion  x  in  fnct 
be  a  function  from  C  x  X  to  [0, 1],  where  the  compatibility  of  a  hypothesis  h  with  the  data  distribution 
D  is  then  Exr..D[x{h,  a:)]-  That  is,  we  require  that  the  degree  of  /^compatibility  be  a  kind  of  unlabeled 
loss  function,  and  the  incompatibility  of  a  hypothesis  h  with  a  data  distribution  D  is  a  quantity  we  can 
think  of  as  an  “unlabeled  error  rate”  that  measures  how  a-priori  unreasonable  we  believe  some  proposed 
hypothesis  to  be.  For  instance,  in  the  example  above  of  a  “margin- style”  compatibility,  we  could  define 
x{f,  x)  fo  be  an  increasing  funcfion  of  the  distance  of  x  to  the  separator  /.  In  this  case,  the  unlabeled  error 
rate,  1  —  x{f,D),  is  a  measure  of  the  probability  mass  close  to  the  proposed  separator.  In  co-training, 
where  each  example  x  has  two  “views”  (x  =  (xi,  X2)),  the  underlying  belief  is  that  the  true  target  c*  can 
be  decomposed  into  functions  (0^,02)  over  each  view  such  that  for  most  examples,  cj(xi)  =  C2(x2).  In 
this  case,  we  can  define  x((/i,/2),  (a:i,X2))  =  1  if  /i(xi)  =  /2(x2),  and  0  if  /i(xi)  /  f2{x2)-  Then  the 
compatibility  of  a  hypothesis  (/i,  /2)  with  an  underlying  distribution  D  is  Pr^^.^  D[fl{xi)  =  f2{x2)]- 

This  framework  allows  us  to  analyze  the  ability  of  a  finite  unlabeled  sample  to  reduce  our  dependence 
on  labeled  examples,  as  a  function  of  (1)  the  compatibility  of  the  target  function  (i.e.,  how  correct  we  were 
in  our  assumption)  and  (2)  various  measures  of  the  “helpfulness”  of  the  distribution.  In  particular,  in  our 
model,  we  find  that  unlabeled  data  can  help  in  several  distinct  ways. 

•  If  the  target  function  is  highly  compatible  with  D  and  belongs  to  C,  then  if  we  have  enough  unla¬ 
beled  data  to  estimate  compatibility  over  all  f  £  C,v/e  can  in  principle  reduce  the  size  of  the  search 
space  from  C  down  to  just  those  f  £  C  whose  estimated  compatibility  is  high.  For  instance,  if  D  is 
“helpful”,  then  the  set  of  such  functions  will  be  much  smaller  than  the  entire  set  C.  In  the  agnostic 
case  we  can  do  (unlabeled)-data-dependent  structural  risk  minimization  to  trade  off  labeled  error 
and  incompatibility. 

•  By  providing  an  estimate  of  D,  unlabeled  data  can  allow  us  to  use  a  more  refined  distribution- 
specific  nofion  of  “hypothesis  space  size”  such  as  Annealed  VC-entropy  [  103 1,  Rademacher  com¬ 
plexities  |43, 12,  155]  or  the  size  of  the  smallest  e-cover  [53|,  rather  than  VC-dimension  [69,  149|. 
In  fact,  for  many  natural  notions  of  compatibility  we  find  fhat  the  sense  in  which  unlabeled  data 


15 


reduces  the  “size”  of  the  search  space  is  best  described  in  these  distribution-specific  measures. 

•  Finally,  if  the  distribution  is  especially  helpful,  we  may  find  fhaf  nof  only  does  fhe  sef  of  compatible 
/  G  C  have  a  small  e-cover,  buf  also  fhe  elemenfs  of  fhe  cover  are  far  apart.  In  fhaf  case,  if  we 
assume  fhe  fargef  funcfion  is  fully  compafible,  we  may  be  able  fo  learn  from  even  fewer  labeled 
examples  fhan  fhe  0(l/e)  needed  jusf  fo  verify  a  good  hypofhesis.  For  insfance,  as  one  application 
of  Ibis,  we  show  fhaf  under  fhe  assumpfion  of  independence  given  fhe  label,  one  can  efficienlly 
perform  Co-Training  of  linear  separators  from  a  single  labeled  example ! 

Our  framework  also  allows  us  fo  address  fhe  issue  of  how  much  unlabeled  dafa  we  should  expecf  to 
need.  Roughly,  fhe  “VCdim/e^”  form  of  sfandard  sample  complexify  bounds  now  becomes  a  bound  on  fhe 
number  of  unlabeled  examples  we  need  fo  uniformly  estimate  compafibilifies.  However,  technically,  fhe 
sef  whose  VC-dimension  we  now  care  abouf  is  nof  C  buf  rafher  a  sef  defined  by  bofh  C  and  x-  fhaf  is,  fhe 
overall  complexify  depends  bofh  on  fhe  complexify  of  C  and  fhe  complexify  of  fhe  notion  of  compafibilify 
(see  Section  2.3.1).  One  consequence  of  our  model  is  fhaf  if  fhe  fargef  funcfion  and  dafa  disfribufion  are 
both  well  behaved  with  respect  to  the  compatibility  notion,  then  the  sample-size  bounds  we  get  for  labeled 
data  can  substantially  beat  what  one  could  hope  to  achieve  through  pure  labeled-data  bounds,  and  we 
illustrate  this  with  a  number  of  examples  through  the  chapter. 

2.1.2  Summary  of  Main  Results 

The  primary  contributions  of  this  chapter  are  the  following.  First,  as  described  above,  we  develop  a 
new  discriminative  (PAC-style)  model  for  semi-supervised  learning,  that  can  be  used  to  analyze  when 
unlabeled  data  can  help  and  how  much  unlabeled  data  is  needed  in  order  to  gain  its  benefits,  as  well  as 
the  algorithmic  problems  involved.  Second,  we  present  a  number  of  sample-complexity  bounds  in  this 
framework,  both  in  terms  of  uniform-convergence  results — which  apply  to  any  algorithm  that  is  able  to 
find  rules  of  low  error  and  high  compafibilify — as  well  as  e-cover-based  bounds  fhaf  apply  fo  a  more 
resfricfed  class  of  algorifhms  buf  can  be  subsfanlially  tighter.  For  insfance,  we  describe  several  nafural 
cases  in  which  e-cover-based  bounds  can  apply  even  fhough  wifh  high  probabilify  fhere  still  exisf  bad 
hypofheses  in  fhe  class  consisfenf  wifh  fhe  labeled  and  unlabeled  examples.  Finally,  we  presenf  several 
PAC-sfyle  algorifhmic  resulfs  in  fhis  model.  Our  main  algorilhmic  resulf  is  a  new  algorifhm  for  Co- 
Training  wifh  linear  separators  fhaf,  if  fhe  disfribufion  safisfies  independence  given  fhe  label,  requires 
only  a  single  labeled  example  to  learn  to  any  desired  error  rate  e  and  is  compufafionally  efficienl  (i.e., 
achieves  PAC  guarantees).  This  subsfanlially  improves  on  fhe  resulfs  of  |62|  which  required  enough 
labeled  examples  fo  produce  an  initial  weak  hypofhesis,  and  in  fhe  process  we  gel  a  simplificalion  fo  fhe 
noisy  halfspace  learning  algorifhm  of  [  64 1 . 

Our  framework  has  helped  analyze  many  of  fhe  exisfing  semi-supervised  learning  mefhods  used  in 
practice  and  has  guided  fhe  developmenl  of  new  semi- supervised  learning  algorifhms  and  analyses.  We 
discuss  fhis  further  in  Section  2.6.1 

2.1.3  Structure  of  this  Chapter 

We  begin  by  describing  fhe  general  selling  in  which  our  resulfs  apply  as  well  as  several  examples  to  il- 
lusfrale  our  framework  in  Secfion  2.2  We  fhen  give  resulfs  bofh  for  sample  complexity  (in  principle,  how 
much  dafa  is  needed  to  learn)  and  efficient  algorithms.  In  lerms  of  sample-complexily,  we  slarl  by  dis¬ 
cussing  uniform  convergence  resulfs  in  Secfion  2.3. 1  For  clarify  we  begin  wifh  fhe  case  of  finile  hypolh- 
esis  spaces  in  Secfion  2.3.  T  and  fhen  discuss  infinite  hypofhesis  spaces  in  Secfion  2.3.1.  These  resulfs 
give  bounds  on  fhe  number  of  examples  needed  for  any  learning  algorifhm  fhaf  produces  a  compafible 


16 


hypothesis  of  low  empirical  error.  We  also  show  how  in  the  agnostic  case  we  can  do  (unlabeled)-data- 
dependent  structural  risk  minimization  to  trade  off  labeled  error  and  incompatibility  in  Section  2.3.1  To 
achieve  tighter  bounds,  in  Section  2.3.2  we  give  results  based  on  the  notion  of  e-cover  size.  These  bounds 
hold  only  for  algorithms  of  a  specific  type  (that  first  use  the  unlabeled  data  to  choose  a  small  set  of  “repre¬ 
sentative”  hypotheses  and  then  choose  among  the  representatives  based  on  the  labeled  data),  but  can  yield 
bounds  substantially  better  than  with  uniform  convergence  (e.g.,  we  can  learn  even  though  there  exist  bad 
h  £  C  consistent  with  the  labeled  and  unlabeled  examples). 

In  Section  2.4,  we  give  our  algorithmic  results.  We  begin  with  a  particularly  simple  class  C  and  com¬ 
patibility  notion  x  for  illustration,  and  then  give  our  main  algorithmic  result  for  Co-Training  with  linear 
separators.  In  Section  2.5  we  discuss  a  transductive  analog  of  our  model,  connections  with  generative 
models  and  other  ways  of  using  unlabeled  data  in  machine  learning,  as  well  as  the  relationship  between 
our  model  and  the  Luckiness  Framework  1 191 1  developed  in  the  context  of  supervised  learning.  Finally, 
in  Section  2.6  we  discuss  some  implications  of  our  model  and  present  our  conclusions,  as  well  a  number 
of  open  problems. 


2.2  A  Formal  Framework 

In  this  section  we  introduce  general  notation  and  terminology  we  use  throughout  the  chapter,  and  describe 
our  model  for  semi-supervised  learning.  In  particular,  we  formally  define  what  we  mean  by  a  notion  of 
compatibility  and  we  illustrate  it  through  a  number  of  examples  including  margins  and  co-training. 

We  will  focus  on  binary  classification  problems.  We  assume  that  our  data  comes  according  to  a 
fixed  unknown  distribution  D  over  an  instance  space  X,  and  is  labeled  by  some  unknown  target  function 
c*  :  X  — >  {0, 1}.  A  learning  algorithm  is  given  a  set  Sl  of  labeled  examples  drawn  i.i.d.  from  D  and 
labeled  by  c*  as  well  as  a  (usually  larger)  set  Su  of  unlabeled  examples  from  D.  The  goal  is  to  perform 
some  optimization  over  the  samples  Sl  and  Su  and  to  output  a  hypothesis  that  agrees  with  the  target  over 
most  of  the  distribution.  In  particular,  the  error  rate  (also  called  “0-1  loss”)  of  a  given  hypothesis  /  is 
defined  as  err(/)  =  erruif)  =  /  c*(x)].  For  any  two  hypotheses  /i,  /2,  the  distance  with 

respect  to  D  between  /i  and  /2  is  defined  as  d(/i,  /2)  =  dz)(/i,  /2)  =  D[h{x)  +  f2{x)].  We  will 
use  eff{f)  to  denote  the  empirical  error  rate  of  /  on  a  given  labeled  sample  (i.e.,  the  fraction  of  mistakes 
on  the  sample)  and  d(/i,  /2)  to  denote  the  empirical  distance  between  /i  and  /2  on  a  given  unlabeled 
sample  (the  fraction  of  the  sample  on  which  they  disagree).  As  in  the  standard  PAC  model,  a  concept 
class  or  hypothesis  space  is  a  set  of  functions  over  the  instance  space  X.  In  the  “realizable  case”,  we 
make  the  assumption  that  the  target  is  in  a  given  class  C,  whereas  in  the  “agnostic  case”  we  do  not  make 
this  assumption  and  instead  aim  to  compete  with  the  best  function  in  the  given  class  C. 

We  now  formally  describe  what  we  mean  by  a  notion  of  compatibility.  A  notion  of  compatibility  is  a 
mapping  from  a  hypothesis  /  and  a  distribution  D  to  [0, 1]  indicating  how  “compatible”  /  is  with  D.  In 
order  for  this  to  be  estimable  from  a  finite  sample,  we  require  that  compatibility  be  an  expectation  over 
individual  examples.^  Specifically,  we  define: 

Definition  2.2.1  A  legal  notion  of  compatibility  is  a  function  x  :  C*  x  X  — >  [0, 1]  where  we  (overloading 
notation)  define  xif^^)  =  ^x^^Dixif^x)].  Given  a  sample  S,  we  define  to  be  the  empirical 

average  of  x  over  the  sample. 

^One  could  imagine  more  general  notions  of  compatibility  with  the  property  that  they  can  he  estimated  from  a  finite  sample 
and  all  our  results  would  go  through  in  that  case  as  well.  We  consider  the  special  case  where  the  compatibility  is  an  expectation 
over  individual  examples  for  simplicity  of  notation,  and  because  most  existing  semi-supervised  learning  algorithms  used  in 
practice  do  satisfy  it. 


17 


Note  1  One  could  also  allow  compatibility  functions  over  k-tuples  of  examples,  in  which  case  our  ( un¬ 
labeled)  sample-complexity  bounds  would  simply  increase  by  a  factor  of  k.  For  settings  in  which  D  is 
actually  known  in  advance  (e.g.,  transductive  learning,  see  Section  2.5.1)  we  can  drop  this  requirement 
entirely  and  allow  any  notion  of  compatibility  x{fi  D)  to  be  legal. 

Definition  2.2.2  Given  compatibility  notion  x,  the  incompatibility  of  f  with  D  is  1  —  xif^D).  We  will 
also  call  this  its  unlabeled  error  rate,  errunl{f)>  when  x  cmd  D  are  clear  from  context.  For  a  given 
sample  S,  we  use  erruniif)  =  1  —  xif^  S)  to  denote  the  empirical  average  over  S. 

Finally,  we  need  a  notation  for  the  set  of  functions  whose  incompatibility  (or  unlabeled  error  rate)  is 
at  most  some  given  value  r. 

Definition  2.2.3  Given  value  t,  we  define  Cd,x{t')  =  {f  ^  C  :  ervuniif)  <  t}.  So,  e.g.,  Co.xi^)  —  C- 
Similarly,  for  a  sample  S,  we  define  Cs,x{t)  =  {/  £  C*  •  ^^unlif)  < 

We  now  give  several  examples  to  illustrate  this  framework: 

Example  1.  Suppose  examples  are  points  in  and  C  is  the  class  of  linear  separators.  A  natural  belief 
in  this  setting  is  that  data  should  be  “well-separated”:  not  only  should  the  target  function  separate  the 
positive  and  negative  examples,  but  it  should  do  so  by  some  reasonable  margin  7.  This  is  the  assumption 
used  by  Transductive  SVM,  also  called  Semi-Supervised  SVM  (S^VM)  [55,  81,  141J.  In  this  case,  if  we 
are  given  7  up  front,  we  could  define  x(/,  x)  =  1  if  x  is  farther  than  distance  7  from  the  hyperplane 
defined  by  /,  and  x{f,  x)  =  0  ofherwise.  So,  fhe  incompafibilify  of  /  wifh  D  is  fhe  probability  mass 
within  distance  7  of  fhe  hyperplane  /  •  x  =  0.  Alternatively,  if  we  do  nol  wanf  fo  commif  fo  a  specific  7 
in  advance,  we  could  define  x(/,  x)  fo  be  a  smoofh  funcfion  of  fhe  disfance  of  x  fo  fhe  separafor,  as  done 
in  [81  j.  Note  fhaf  in  confrasf,  defining  compafibilify  of  a  hypofhesis  based  on  fhe  largesf  7  such  fhaf  D 
has  probabilify  mass  exactly  zero  wifhin  disfance  7  of  fhe  separafor  would  not  til  our  model:  if  cannof  be 
wriffen  as  an  expecfafion  over  individual  examples  and  indeed  would  nol  be  a  good  definition  since  one 
cannof  dislinguish  “zero”  from  “exponenlially  close  fo  zero”  from  a  small  sample  of  unlabeled  dafa. 

Example  2.  In  co-lraining  [62[,  we  assume  examples  x  each  confain  fwo  “views”:  x  =  (xi,  X2),  and 
our  goal  is  fo  learn  a  pair  of  functions  (/i ,  f 2),  one  on  each  view.  For  insfance,  if  our  goal  is  fo  classify  web 
pages,  we  mighf  use  xi  fo  represenf  fhe  words  on  fhe  page  ilself  and  X2  to  represent  the  words  attached 
to  links  pointing  to  this  page  from  other  pages.  The  hope  underlying  co-training  is  that  the  two  parts  of 
the  example  are  generally  consistent,  which  then  allows  the  algorithm  to  bootstrap  from  unlabeled  data. 
For  example,  iterative  co-training  uses  a  small  amount  of  labeled  data  to  learn  some  initial  information 
(e.g.,  if  a  link  with  the  words  “my  advisor”  points  to  a  page  then  that  page  is  probably  a  faculty  member’s 
home  page).  Then,  when  it  finds  an  unlabeled  example  where  one  side  is  confidenf  (e.g.,  fhe  link  says  “my 
advisor”),  if  uses  fhaf  fo  label  fhe  example  for  framing  over  fhe  ofher  view.  In  regularized  co-training, 
one  affempfs  fo  direcfly  opfimize  a  weighted  combination  of  accuracy  on  labeled  dafa  and  agreemenf  over 
unlabeled  dafa.  These  approaches  have  been  used  for  a  variety  of  learning  problems,  including  named 
entity  classification  [87[,  fexf  classification  [116,  175[,  nafural  language  processing  [182[,  large  scale 
documenf  classification  [  180[,  and  visual  defectors  [  159[.  As  mentioned  in  Section  2.1 ,  fhe  assumptions 
underlying  fhis  mefhod  fif  nafurally  info  our  framework.  In  parficular,  we  can  define  fhe  incompafibilify  of 
some  hypofhesis  (/i,  /2)  wifh  disfribufion  D  as  Pv (xi,x2)''-D[fi{xi)  /  f2{x2)]-  Similar  notions  are  given 
in  subsequenf  work  of  [  184,  196  [  for  ofher  types  of  learning  problems  (e.g.  regression)  and  for  ofher  loss 
functions. 

Example  3.  In  fransducfive  graph-based  mefhods,  we  are  given  a  sef  of  unlabeled  examples  connected 
in  a  graph  g,  where  fhe  inferprefafion  of  an  edge  is  fhaf  we  believe  fhe  fwo  endpoinfs  of  fhe  edge  should 
have  fhe  same  label.  Given  a  few  labeled  verfices,  various  graph-based  mefhods  fhen  affempf  fo  use 
fhem  fo  infer  labels  for  fhe  remaining  poinfs.  If  we  are  willing  to  view  79  as  a  disfribufion  over  edges 


18 


(a  uniform  distribution  if  g  is  unweighted),  then  as  in  co-training  we  can  define  the  incompatibility  of 
some  hypothesis  /  as  the  probability  mass  of  edges  that  are  cut  by  /,  which  then  motivates  various  cut- 
based  algorithms.  For  instance,  if  we  require  /  to  be  boolean,  then  the  mincut  method  of  [59|  finds  fhe 
mosf-compafible  hypofhesis  consisfenf  wifh  fhe  labeled  dafa;  if  we  allow  /  fo  be  fraclional  and  define 
1  —  x{fi  {xi,  X2))  =  —  /(x2))^,  then  the  algorithm  of  |215|  finds  fhe  mosf-compafible  consisfenf 

hypofhesis.  If  we  do  nof  wish  fo  view  Z?  as  a  disfribufion  over  edges,  we  could  have  be  a  disfribufion 
over  vertices  and  broaden  Definilion  2.2.1  fo  allow  for  x  to  be  a  function  over  pairs  of  examples.  In  fact,  as 
mentioned  in  Note  1 ,  since  we  have  perfect  knowledge  of  D  in  this  setting  we  can  allow  any  compatibility 
function  x{f,  D)  to  be  legal.  We  discuss  more  connections  with  graph-based  methods  in  Section  2.5.1 

Example  4.  As  a  special  case  of  co-training,  suppose  examples  are  pairs  of  points  in  R'^,  C  is  the 
class  of  linear  separators,  and  we  believe  the  two  points  in  each  pair  should  both  be  on  the  same  side  of 
the  target  function.  (So,  this  is  a  version  of  co-training  where  we  require  /i  =  /2.)  The  motivation  is  that 
we  want  to  use  pairwise  information  as  in  Example  3,  but  we  also  want  to  use  the  features  of  each  data 
point.  For  instance,  in  the  word-sense  disambiguation  problem  studied  by  [210|,  the  goal  is  to  determine 
which  of  several  dictionary  definitions  is  intended  for  some  target  word  in  a  piece  of  text  (e.g.,  is  “plant” 
being  used  to  indicate  a  tree  or  a  factory?).  The  local  context  around  each  word  can  be  viewed  as  placing 
it  into  R‘^,  but  the  edges  correspond  to  a  completely  different  type  of  information:  the  belief  that  if  a  word 
appears  twice  in  the  same  document,  it  is  probably  being  used  in  the  same  sense  both  times.  In  this  setting, 
we  could  use  the  same  compatibility  function  as  in  Example  3,  but  rather  than  having  the  concept  class  C 
be  all  possible  functions,  we  restrict  C  to  just  linear  separators. 

Example  5.  In  a  related  setting  to  co-training  considered  by  |158  |,  examples  are  single  points  in  X 
but  we  have  a  pair  of  hypothesis  spaces  (Ci,  C2)  (or  more  generally  a  Z-tuple  (Ci, . . . ,  Ck)),  and  the  goal 
is  to  find  a  pair  of  hypofheses  (/i,  /2)  G  Ci  x  6*2  wifh  low  error  over  labeled  dafa  and  fhaf  agree  over  fhe 
disfribufion.  For  insfance,  if  dafa  is  sufficienfly  “well-separafed”,  one  mighf  expecf  fhere  fo  exisf  bofh  a 
good  linear  separafor  and  a  good  decision  free,  and  one  would  like  fo  use  fhis  assumpfion  fo  reduce  fhe  need 
for  labeled  dafa.  In  fhis  case  one  could  define  compafibilify  of  (/i,  /2)  wifh  D  as  Pr^;..^  D[h{x)  =  f2{x)], 
or  fhe  similar  notions  given  in  |158,  189|. 

2.3  Sample  Complexity  Results 

We  now  presenf  several  sample-complexify  bounds  fhaf  can  be  derived  in  fhis  framework,  showing  how 
unlabeled  dafa,  fogefher  wifh  a  suifable  compafibilify  nofion,  can  reduce  fhe  need  for  labeled  examples.  We 
do  nof  focus  on  giving  fhe  fighfesf  possible  bounds,  buf  insfead  on  fhe  fypes  of  bounds  and  fhe  quanfifies 
on  which  fhey  depend,  in  order  fo  beffer  undersfand  whaf  if  is  abouf  fhe  learning  problem  one  can  hope  fo 
leverage  from  wifh  unlabeled  dafa. 

The  high-level  sfrucfure  of  all  of  fhese  resulfs  is  as  follows.  Firsf,  given  enough  unlabeled  dafa  (where 
“enough”  will  be  a  funclion  of  some  measure  of  fhe  complexify  of  C  and  possibly  of  x  as  well),  we  can 
uniformly  esfimafe  fhe  frue  compatibilities  of  all  funcfions  in  C  using  fheir  empirical  compafibilifies  over 
fhe  sample.  Then,  by  using  fhis  quanfify  fo  give  a  preference  ordering  over  fhe  funcfions  in  C,  in  fhe 
realizable  case  we  can  reduce  “C”  down  fo  “fhe  sef  of  funcfions  in  C  whose  compafibilify  is  nof  much 
larger  fhan  fhe  frue  fargef  funcfion”  in  bounds  for  fhe  number  of  labeled  examples  needed  for  learning.  In 
fhe  agnosfic  case  we  can  do  (unlabeled)-dafa-dependenf  sfrucfural  risk  minimization  fo  frade  off  labeled 
error  and  incompafibilify.  The  specific  bounds  differ  in  ferms  of  fhe  exacf  complexify  measures  used  (and  a 
few  ofher  issues)  and  we  provide  examples  illusfraling  when  and  how  certain  complexify  measures  can  be 
significanfly  more  powerful  fhan  ofhers.  Moreover,  one  can  prove  fallback  properties  of  fhese  procedures 
—  fhe  number  of  labeled  examples  required  is  never  much  worse  fhan  fhe  number  of  labeled  examples 


19 


required  by  a  standard  supervised  learning  algorithm.  However,  if  the  assumptions  happen  to  be  right,  one 
can  significantly  benefit  by  using  the  unlabeled  data. 

2.3.1  Uniform  Convergence  Bounds 

We  begin  with  uniform  convergence  bounds  (later  in  Section  2.3.2  we  give  tighter  e-cover  bounds  that 
apply  to  algorithms  of  a  particular  form).  For  clarity,  we  begin  with  the  case  of  finite  hypothesis  spaces 
where  we  measure  the  “size”  of  a  set  of  functions  by  just  the  number  of  functions  in  the  set.  We  then 
discuss  several  issues  that  arise  when  considering  infinite  hypothesis  spaces,  such  as  what  is  an  appropriate 
measure  for  the  “size”  of  the  set  of  compatible  functions,  and  the  need  to  account  for  the  complexity  of  the 
compatibility  notion  itself.  Note  that  in  the  standard  PAC  model,  one  typically  talks  of  either  the  realizable 
case,  where  we  assume  that  the  target  function  c*  belongs  to  C,  or  the  agnostic  case  where  we  allow  any 
target  function  c*  1 149 1.  In  our  setting,  we  have  the  additional  issue  of  unlabeled  error  rate,  and  can  either 
make  an  a-priori  assumption  that  the  target  function’s  unlabeled  error  is  low,  or  else  provide  a  bound  in 
which  our  sample  size  (or  error  rate)  depends  on  whatever  its  unlabeled  error  happens  to  be.  We  begin  in 
Sections  2.3.1  and  2.3.1  with  bounds  for  the  the  setting  in  which  we  assume  c*  G  C,  and  then  in  Section 
2.3.1  we  consider  the  agnostic  case  where  we  remove  this  assumption. 


Finite  hypothesis  spaces 


We  first  give  a  bound  for  the  “doubly  realizable”  case  where  we  assume  c*  G  C  and  err„„;(c*)  =  0. 

Theorem  2.3.1  If  c*  G  C  and  err  uni  {c* )  =  0,  then  mu  unlabeled  examples  and  mi  labeled  examples  are 
sufficient  to  learn  to  error  e  with  probability  1—5,  where 


ruu 


1 

e 


ln|C|  +ln 


2 

5 


and  mi 


1 

e 


ln|C'£,,x(e)l  +ln 


2 

~5 


In  particular,  with  probability  at  least  1  —  5,  all  f  G  C  with  efr(f)  =  0  and  erruniif)  =  0  have 
err{f)  <  e. 

Proof:  The  probability  that  a  given  hypothesis  /  with  erruniif)  >  e  has  erruniif)  =  0  is  at  most 
(1  —  e)'”"  <  2|f7|  for  tho  given  value  of  mu-  Therefore,  by  the  union  bound,  the  number  of  unlabeled 
examples  is  sufficient  to  ensure  that  with  probability  1  —  f ,  only  hypotheses  in  Cu^^-ie)  have  erruniif)  = 
0.  The  number  of  labeled  examples  then  similarly  ensures  that  with  probability  1  —  none  of  those  whose 
true  error  is  at  least  e  have  an  empirical  error  of  0,  yielding  the  theorem.  ■ 

Interpretation:  If  the  target  function  indeed  is  perfectly  correct  and  compatible,  then  Theorem  2.3.1 
gives  sufficient  conditions  on  the  number  of  examples  needed  to  ensure  that  an  algorithm  that  optimizes 
both  quantities  over  the  observed  data  will,  in  fact,  achieve  a  PAC  guarantee.  To  emphasize  this,  we  will 
say  that  an  algorithm  efficiently  PAC„„;-leams  the  pair  (C,  x)  if  it  is  able  to  achieve  a  PAC  guarantee 
using  time  and  sample  sizes  polynomial  in  the  bounds  of  Theorem  2.3.1  For  a  formal  definition  see 
Definition  2.3.1  at  the  end  of  this  section. 

We  can  think  of  Theorem  2.3.1  as  bounding  the  number  of  labeled  examples  we  need  as  a  function  of 
the  “helpfulness”  of  the  distribution  D  with  respect  to  our  notion  of  compatibility.  That  is,  in  our  context, 
a  helpful  distribution  is  one  in  which  CD,xi^)  is  small,  and  so  we  do  not  need  much  labeled  data  to  identify 
a  good  function  among  them.  We  can  get  a  similar  bound  in  the  situation  when  the  target  function  is  not 
fully  compatible: 


20 


Theorem  2.3.2  If  c*  G  C  and  erruni{c*)  =  t,  then  mu  unlabeled  examples  and  mi  labeled  examples  are 
sufficient  to  learn  to  error  e  with  probability  1—5,  for 


mu  = 


ln|C|  +ln- 
0 


and  mi  =  — 
€ 


In  +  2e)|  +  In  - 


In  particular,  with  probability  at  least  1  —  5,  the  f  ^  C  that  optimizes  efruniif)  subject  to  efr{f)  =  0 
has  err{f)  <  e. 

Alternatively,  given  the  above  number  of  unlabeled  examples  mu,  for  any  number  of  labeled  examples 
mi,  with  probability  at  least  1  —  5,  the  f  ^  C  that  optimizes  efruniif)  subject  to  efr{f)  =  0  has 


err{f) 


< 


1 

mi 


ln\CD,x{erruni{c*)  +  2e)|  +  In 


2 

5 


(2.1) 


Proof:  By  Hoeffding  bounds,  mu  is  sufficiently  large  so  that  with  probability  at  least  1  —  <5/2,  all  /  G  C 
have  \efruni{f)  -  erruni{f)\  <  e-  Thus,  {/  G  C  :  erfuniif)  <  t  +  e}  C  CD,x{t  +  2e).  For  the  first 
implication,  the  given  bound  on  mi  is  sufficient  so  that  with  probability  at  least  1  —  5,  all  /  G  C  with 
efr{f)  =  0  and  err „„;(/)  <  t +  e  have  err(/)  <  e;  furthermore,  efruni{c*)  <  f  +  e,  so  such  a  function  / 
exists.  Therefore,  with  probability  at  least  1  —  <5,  the  f  £  C  that  optimizes  err  uni  (/)  subject  to  efr{f)  =  0 
has  err{f)  <  e,  as  desired.  For  second  implication,  inequality  (2.1)  follows  immediately  by  solving  for 
the  labeled  estimation-error  as  a  function  of  m;.  ■ 


Interpretation:  Theorem  2.3.2  has  several  implications.  Specifically: 

1.  If  we  can  optimize  the  (empirical)  unlabeled  error  rate  subject  to  having  zero  empirical  labeled 
error,  then  to  achieve  low  true  error  it  suffices  to  draw  a  number  of  labeled  examples  that  depends 
logarithmically  on  the  number  of  functions  in  C  whose  unlabeled  error  rate  is  at  most  2e  greater 
than  that  of  the  target  c* . 

2.  Alternatively,  for  any  given  number  of  labeled  examples  mi,  we  can  provide  a  bound  (given  in 
equation  2.1)  on  our  error  rate  that  again  depends  logarithmically  on  the  number  of  such  functions, 
i.e.,  with  high  probability  the  function  f  £  C  that  optimizes  efruniif)  subject  to  efr{f)  =  0  has 
err(/)  <  ^  \hi\CD,x{(^rruni{c*)  +  2e)  \  +ln|]. 

3.  If  we  have  a  desired  maximum  error  rate  e  and  do  not  know  the  value  of  erruni{c*)  but  have  the 
ability  to  draw  additional  labeled  examples  as  needed,  then  we  can  simply  do  a  standard  “doubling 
trick”  on  mi.  On  each  round,  we  check  if  the  hypothesis  /  found  indeed  has  sufficiently  low 
empirical  unlabeled  error  rate,  and  we  spread  the  “5”  parameter  across  the  different  runs.  See,  e.g.. 
Corollary  2.3.6  in  Section  2.3.1 

Finally,  before  going  to  infinite  hypothesis  spaces,  we  give  a  simple  Occam-style  version  of  the  above 
bounds  for  this  setting.  Given  a  sample  S,  let  us  define  desc5(/)  =  ln\Cs,x{^^uni{f))\-  That  is, 
desc5(/)  is  the  description  length  of  /  (in  “nats”)  if  we  sort  hypotheses  by  their  empirical  compatibility 
and  output  the  index  of  /  in  this  ordering.  Similarly,  define  e-descD(/)  =  In  +  f)l-  This 

is  an  upper-bound  on  the  description  length  of  /  if  we  sort  hypotheses  by  an  e-approximation  to  the  their 
true  compatibility.  Then  we  immediately  get  a  bound  as  follows: 

Corollary  2.3.3  For  any  set  S  of  unlabeled  data,  given  mi  labeled  examples,  with  probability  at  least 
1  —  5,  all  f  £  C  satisfying  err{f)  =  0  and  descgif)  <  emi  —  ln(l/(5)  have  err{f)  <  e.  Furthermore,  if 
I'S'I  >  ^[In  \C\  +  In  |],  then  with  probability  at  least  1  —  5,  all  f  £  C  satisfy  descsif)  ^  e-descz)(/). 

Interpretation:  The  point  of  this  bound  is  that  an  algorithm  can  use  observable  quantities  (the  “empirical 
description  length”  of  the  hypothesis  produced)  to  determine  if  it  can  be  confident  that  its  true  error  rate 


21 


is  low  (I.e.,  if  we  can  find  a  hypothesis  with  desc5(/)  <  emi  —  ln(l/(5)  and  err{f)  =  0,  we  can  be 
confident  that  it  has  error  rate  at  most  e).  Furthermore,  if  we  have  enough  unlabeled  data,  the  observable 
quantities  will  be  no  worse  than  if  we  were  learning  a  slightly  less  compatible  function  using  an  infinite- 
size  unlabeled  sample. 

Note  that  if  we  begin  with  a  non-distribution-dependent  ordering  of  hypotheses,  inducing  some  de¬ 
scription  length  desc(/),  and  our  compatibility  assumptions  turn  out  to  be  wrong,  then  it  could  well  be  that 
descD(c*)  >  desc(c*).  In  this  case  our  use  of  unlabeled  data  would  end  up  hurting  rather  than  helping. 
However,  notice  that  by  merely  interleaving  the  initial  ordering  and  the  ordering  produced  by  S,  we  get  a 
new  description  length  desCnei«(/)  such  that 

descne,«(/)  <  1  -hmin(desc(/),desc5(/)). 

Thus,  up  to  an  additive  constant,  we  can  get  the  best  of  both  orderings. 

Also,  if  we  have  the  ability  to  purchase  additional  labeled  examples  until  the  function  produced  is 
sufficiently  “short”  compared  to  the  amount  of  data,  then  we  can  perform  the  usual  stratification  and  be 
confident  whenever  we  find  a  consistent  function  /  such  that  descsif)  ^  where  mi 

is  the  number  of  labeled  examples  seen  so  far. 

Efficient  algorithms  in  our  model  Finally,  we  end  this  section  with  a  definition  describing  our  goals  for 
efficient  learning  algorithms,  based  on  the  above  sample  bounds. 

Definition  2.3.1  Given  a  class  C  and  compatibility  notion  x,  we  say  that  an  algorithm  efficiently  PAC„„/- 
learns  the  pair  {C,x)  if>  for  any  distribution  D,  for  any  target  function  c*  £  C  with  errunl{c*)  = 
0,  for  any  given  e  >  0,  5  >  0,  with  probability  at  least  1  —  5  it  achieves  error  at  most  e  using 
poly(log  jCI,  1/e,  1/(5)  unlabeled  examples  and  poly(log  |C'D,x(e)|)  l/c;  l/<5)  labeled  examples,  and  with 
time  which  is  polyifog  |C|,  1/e,  1/(5). 

We  say  that  an  algorithm  semi-agnostically  VACuni-loarns  {C,  x)  if  it  is  able  to  achieve  this  guarantee 
for  any  c*  G  C  even  if  err  uni{c*)  /  0,  using  labeled  examples  poly{log\CD,x{o-rruni{c*)  +  e)\, 1/e, 1/6). 

Infinite  hypothesis  spaces 

To  reduce  notation,  we  will  assume  in  the  rest  of  this  chapter  that  x(/)  x)  £  {0, 1}  so  that  x(/)  D)  = 
Pi'a:~D[x(/)  x)  =  1].  However,  all  our  sample  complexity  results  can  be  easily  extended  to  the  general 
case. 

For  infinite  hypothesis  spaces,  the  first  issue  that  arises  is  that  in  order  to  achieve  uniform  convergence 
of  unlabeled  error  rates,  the  set  whose  complexity  we  care  about  is  not  C  but  rather  x{C)  =  {xj  ■  f  £  C} 
where  Xf  ■  ^  ^  {Oj  1}  and  Xf{x)  =  xif  x).  For  instance,  suppose  examples  are  just  points  on  the  line, 
and  C  =  {fa{x)  ■  fa{x)  =  1  iff  x  <  a}.  In  this  case,  VCdim(C')  =  1.  However,  we  could  imagine 
a  compatibility  function  such  that  x{fa,x)  depends  on  some  complicated  relationship  between  the  real 
numbers  a  and  x.  In  this  case,  VCdim(x(C'))  is  much  larger,  and  indeed  we  would  need  many  more 
unlabeled  examples  to  estimate  compatibility  over  all  of  C. 

A  second  issue  is  that  we  need  an  appropriate  measure  for  the  “size”  of  the  set  of  surviving  functions. 
VC-dimension  tends  not  to  be  a  good  choice:  for  instance,  if  we  consider  the  case  of  Example  1  (margins), 
then  even  if  data  is  concentrated  in  two  well-separated  “blobs”,  the  set  of  compatible  separators  still  has  as 
large  a  VC-dimension  as  the  entire  class  even  though  they  are  all  very  similar  with  respect  to  D  (see,  e.g.. 
Figure  2.1  after  Theorem  2.3.5  below).  Instead,  it  is  better  to  consider  distribution  dependent  complexity 
measures  such  as  annealed  VC-entropy  [  103 1  or  Rademacher  averages  |43 , 72 ,  155 1.  For  this  we  introduce 
some  notation.  Specifically,  for  any  C,  we  denote  hy  C[m,  D]  the  expected  number  of  splits  of  m  points 
(drawn  i.i.d.)  from  D  using  concepts  in  C.  Also,  for  a  given  (fixed)  S'  C  A,  we  will  denote  by  S  the 


22 


uniform  distribution  over  S,  and  by  C[m,  5]  the  expected  number  of  splits  of  m  points  from  S  using 
concepts  in  C.  The  following  is  the  analog  of  Theorem  2.3.2  for  the  infinite  case. 

Theorem  2.3.4  If  c*  E  C  and  erruni{c*)  =  t  then  mu  unlabeled  examples  and  mi  labeled  examples  are 
sufficient  to  learn  to  error  e  with  probability  1—5,  for 


m. 


=  O 


fVCdimixiC))  ,  1  12 

'  - ^  ”  In  -  +  ^  In  - 

e  0 


I 


and 

In  +  2e)[2mi,Zl]^ +ln^  > 

where  recall  Cd,x{'^  +  2e)[2m/,  D]  is  the  expected  number  of  splits  of2mi  points  drawn  from  D  using 
concepts  in  C  of  unlabeled  error  rate  <  f  +  2e.  In  particular,  with  probability  at  least  1  —  5,  the  f  £  C 
that  optimizes  efruniif)  subject  to  efr{f)  =  0  has  err(f)  <  e. 

Proof:  Let  S  be  the  set  of  unlabeled  examples.  By  standard  VC-dimension  bounds  (e.g.,  see 
Theorem  A.  1 . 1  in  Appendix  A.  1 . 1 )  the  number  of  unlabeled  examples  given  is  sufficient  to  ensure  that 
with  probability  at  least  1  —  |  we  have  |  Pixr~.s[Xfix)  =  1]  —  Pra;~  D[Xfix)  =  1]|  <  e  for  all  X/  G  xiC). 
Since  Xf{x)  =  x(/)  x),  this  implies  that  we  have  |erf^i„;(/)  —  err„„;(/)|  <  e  for  all  f  £  C.  So,  the  set 
of  hypotheses  with  err „„;(/)  <  t  +  e  is  contained  in  Cd,x{'^  +  2e). 

The  bound  on  the  number  of  labeled  examples  now  follows  directly  from  known  concentration  results 
using  the  expected  number  of  partitions  instead  of  the  maximum  in  the  standard  VC-dimension  bounds 
(e.g.,  see  Theorem  A.  1.2  in  Appendix  A.  1.1  ).  This  bound  ensures  that  with  probability  1  —  none  of  the 
functions  /  E  CD,x{t  +  2e)  with  err{f)  >  e  have  err{f)  =  0. 

The  above  two  arguments  together  imply  that  with  probability  1  —  5,  all  /  E  C  with  efr{f)  =  0  and 
e^uniif)  <  f  +  e  have  err(/)  <  e,  and  furthermore  c*  has  efruni{c*)  <  t  +  e.  This  in  turn  implies  that 
with  probability  at  least  1  —  5,  the  f  £  C  that  optimizes  erruniif)  subject  to  err{f)  =  0  has  err{f)  <  e 
as  desired.  ■ 


We  can  also  give  a  bound  where  we  specify  the  number  of  labeled  examples  as  a  function  of  the  unla¬ 
beled  sample-,  this  is  useful  because  we  can  imagine  our  learning  algorithm  performing  some  calculations 
over  the  unlabeled  data  and  then  deciding  how  many  labeled  examples  to  purchase. 

Theorem  2.3.5  If  c*  £  C  and  ervuniic*)  =  t,  then  an  unlabeled  sample  S  of  size 

^ /'max.[VCdim{C),VCdim{x{C))]  ^  1  1  ^  2\ 

V  e  e'^  5  J 

is  sufficient  so  that  if  we  label  mi  examples  drawn  uniformly  at  random  from  S,  where 

ln(2C'5,x(i  +  e)  S'] )  +  In  ^ 

then  with  probability  at  least  1  —  5,  the  f  £  C  that  optimizes  erruniif)  subject  to  efr{f)  =  0  has 
err{f)  <  e. 

Proof:  Standard  VC-bounds  (in  the  same  form  as  for  Theorem  2.3.4)  imply  that  the  number  of  labeled 
examples  mi  is  sufficient  to  guarantee  the  conclusion  of  the  theorem  with  “err{f)”  replaced  by  “err^(/)” 
(the  error  with  respect  to  S)  and  “e”  replaced  with  “e/2”.  The  number  of  unlabeled  examples  is  enough 
to  ensure  that,  with  probability  >1-1  ,  for  all  f  £  C,  |err(/)  —  err^if)\  <  e/2.  Combining  these  two 
statements  yields  the  theorem.  ■ 


4 

mi>  - 
e 


23 


Note  that  if  we  assume  erruni{c*)  =  0,  then  we  ean  use  the  set  C5^^(0)  instead  of  Cs^xi^  +  e)  in  the 
formula  giving  the  number  of  labeled  examples  in  Theorem  2.3.5 

Note:  Notice  that  for  the  setting  of  Example  1,  in  the  worst  case  (over  distributions  D)  this  will  essentially 
recover  the  standard  margin  sample-complexity  bounds  for  the  number  of  labeled  examples.  In  particular, 
6*5^^ (0)  contains  only  those  separators  that  split  5  with  margin  >  7,  and  therefore,  s  =  |C'5^^(0)[2m/,  5]  | 
is  no  greater  than  the  maximum  number  of  ways  of  splitting  2m;  points  with  margin  7.  However,  if  the 
distribution  is  helpful,  then  the  bounds  can  be  much  better  because  there  may  be  many  fewer  ways  of 
splitting  S  with  margin  7.  For  instance,  in  the  case  of  two  well-separated  “blobs”  illustrated  in  Figure  2. 1 , 
if  S  is  large  enough,  we  would  have  just  s  =  4. 


fi 


f* 


Figure  2.1:  Finear  separators  with  a  margin-based  notion  of  compatibility.  If  the  distribution  is  uniform 
over  two  well-separated  “blobs”  and  the  unlabeled  set  S  is  sufficiently  large,  the  set  contains  only 

four  different  partitions  of  S,  shown  in  the  figure  as  /i,  /2,  /s,  and  f^.  Therefore,  Theorem  2.3.5  implies 
fhaf  we  only  need  0(1 /e)  labeled  examples  fo  leai'n  well. 

Theorem  2.3.5  immediafely  implies  fhe  following  slralified  version,  which  applies  fo  fhe  case  in  which 
one  repeatedly  draws  labeled  examples  unfil  fhaf  number  is  sufficienf  fo  jusfify  fhe  mosf-compafible  hy¬ 
pothesis  found. 

Corollary  2.3.6  An  unlabeled  sample  S  of  size 

^  / max[VCdim(C) ,  VCdim(x(C))]  ^  ^  ^  ^ 

V  e  S  J 

is  sufficient  so  that  with  probability  >1  —  5  we  have  that  simultaneously  for  every  k  >0  the  following  is 
true:  if  we  label  nik  examples  drawn  uniformly  at  random  from  S,  where 

In  {2Cs,x{{k  +  1)6)  [2mfe,  5] )  +  In  +11. 

then  all  f  €  C  with  err{f)  =  0  and  efrunlif)  <  (^  +  l)e  have  err(f)  <  e. 

Interpretation:  This  corollary  is  an  analog  of  Theorem  2.3.3  and  it  justifies  a  sfrafificafion  based  on 
fhe  esfimafed  unlabeled  error  rales.  Thai  is,  beginning  wilh  A:  =  0,  one  draws  fhe  specified  number 
of  examples  and  checks  fo  see  if  a  sufficienlly  compatible  hypofhesis  can  be  found.  If  so,  one  halls  wilh 
success,  and  if  nol,  one  incremenfs  k  and  fries  again.  Since  A:  <  -,  we  clearly  have  a  fallback  properly:  fhe 
number  of  labeled  examples  required  is  never  much  worse  fhan  fhe  number  of  labeled  examples  required 
by  a  slandard  supervised  learning  algorilhm. 

If  one  does  nol  have  fhe  abilily  fo  draw  addilional  labeled  examples,  Ihen  we  can  fix  m;  and  inslead 
slralify  over  estimation  error  as  in  |45|.  We  discuss  Ibis  furlher  in  our  agnostic  bounds  in  Section  2.3.1 
below. 


4 

mk>  - 
€ 


24 


The  agnostic  case 


The  bounds  given  so  far  have  been  based  on  the  assumption  that  the  target  function  belongs  to  C  (so 
that  we  can  assume  there  will  exist  f  £  C  with  (frr{f)  =  0).  One  can  also  derive  analogous  results  for 
the  agnostic  (unrealizable)  case,  where  we  do  not  make  that  assumption.  We  first  present  one  immediate 
bound  of  this  form,  and  then  show  how  we  can  use  it  in  order  to  trade  off  labeled  and  unlabeled  error 
in  a  near-optimal  way.  We  also  discuss  the  relation  of  this  to  a  common  “regularization”  technique  used 
in  semi-supervised  learning.  As  we  will  see,  the  differences  between  these  two  point  to  certain  potential 
pitfalls  in  the  standard  regularization  approach. 

Theorem  2.3.7  Let  =  argminj-g(^[err(/)|err„„;(/)  <  t].  Then  an  unlabeled  sample  S  of  size 

{ m.Si-x\y C dimiC) ,V C dimixiC))]  ^  1  1  ,  2\ 

- LJi - + 


and  a  labeled  sample  of  size 

log  +  2e)[2mi,T)]^  ^ 

is  sufficient  so  that  with  probability  >1  —  5,  the  f  £  C  that  optimizes  efir{f)  subject  to  efruniif)  <  t  +  e 
has  err{f)  <  err{ff)  +  e  +  Y^log(4/5)/(2m;)  <  err{ff)  -|-  2e. 

Proof:  The  given  unlabeled  sample  size  implies  that  with  probability  1  —  5/2,  all  /  £  C  have 
|err^„;(/)  —  erruni{f)\  <  e,  which  also  implies  that  efruniif t)  <  f  +  e-  The  labeled  sample  size, 
using  standard  VC  bounds  (e.g.  Theorem  A.  1.3  in  the  Appendix  A.  1.2)  imply  that  with  probability  at  least 
1  —  5/4,  all  /  G  C'D,x(f +  2e)  have  |efr(/)  —  err(/)|  <  e.  Finally,  by  Hoeffding  bounds,  with  probability 
at  least  1  —  5/4  we  have 

err(/;)  <  err{ff)  +  \/log(4/5)/(2m,). 

Therefore,  with  probability  at  least  1  —  5,  the  f  £  C  that  optimizes  err(/)  subject  to  efruniif)  <  f  +  e 
has 


err  if)  <  err(/)  -|-  e  <  err(//)  +  e  <  err  iff)  -|-  e  -|-  Y^log(4/5)/(2m;)  <  err(//)  -|-  2e, 
as  desired.  ■ 

Interpretation:  Given  a  value  t,  Theorem  2.3.7  bounds  the  number  of  labeled  examples  needed  to  achieve 
error  at  most  e  larger  than  that  of  the  best  function  ff  of  unlabeled  error  rate  at  most  t.  Alternatively,  one 
can  also  state  Theorem  2.3.7  in  the  form  more  commonly  used  in  statistical  learning  theory:  given  any 
number  of  labeled  examples  mi  and  given  t  >  0,  Theorem  2.3.7  implies  that  with  high  probability,  the 
function  /  that  optimizes  efrif)  subject  to  efruniif)  <  t  +  e  satisfies 


err  if)  <  err  if)  +  et  <  err(//)  +  et  + 

where 

et  =  y^log  (8CD,xit  +  ‘^e)[2mi,  D]/6y 

Nofe  fhaf  as  usual,  fhere  is  an  inherenf  fradeoff  here  befween  fhe  qualify  of  fhe  comparison  funcfion  //, 
which  improves  as  t  increases,  and  fhe  estimation  error  et,  which  gefs  worse  as  t  increases.  Ideally,  one 


log(4/5) 

2mi 


25 


would  like  to  achieve  a  bound  of  mint  [err  (/j* )  +  ^i]  +  -\/log(4/(5)  / (2m;);  i.e.,  as  if  the  optimal  value  of 
t  were  known  in  advance.  We  can  perform  nearly  as  well  as  this  bound  by  (1)  performing  a  stratification 
over  t  (so  that  the  bound  holds  simultaneously  for  all  values  of  t)  and  (2)  using  an  estimate  it  of  et  that 
we  can  calculate  from  the  unlabeled  sample  and  therefore  use  in  the  optimization.  In  particular,  letting 
ft  =  argminj/gc’[efr(/')  :  erruniif)  <  t],  we  will  output  /  =  argminjjerr(/t)  +  it]. 

Specifically,  given  a  sef  S  of  unlabeled  examples  and  mi  labeled  examples,  lef 

it  =  it{S,  mi)  =  J—  log  {8Cs,x{t)[mi,  -S']), 

V 

where  we  define  <5]  to  be  the  number  of  different  partitions  of  the  first  mi  points  in  S  using 

functions  in  C'5^^(f),  i.e.,  using  functions  of  empirical  unlabeled  error  at  most  t  (we  assume  [S'!  >  mi). 
Then  we  have  the  following  theorem. 

Theorem  2.3.8  Let  ft  =  argminj,g(^[err(/')|err„„i(/')  <  t]  and  define  i{f')  =  if  fort'  =  erruniif). 
Then,  given  mi  labeled  examples,  with  probability  at  least  1  —  5,  the  function 

f  =  argminj/[err(/')  +  e(/')] 


satisfies  the  guarantee  that 


err{f)  <  mm[err{ff)  +  i{ff)]  +  5 
Proof:  First  we  argue  that  with  probability  at  least  1  —  5/2,  for  all  f  £  C  v/e  have 


'log(8/5) 

mi 


errif)  <  err{f)  +  e(/')  +  4^ 


/log(8/5) 

mi 


In  particular,  define  Co  =  C5^;;^(0)  and  inducfively  for  /c  >  0  define  Ck  =  Cs,xitk)  for  tk  such  fhaf 
Ck[mi,  5]  =  8Cfc_i[m;,  5].  (If  necessary,  arbifrarily  order  fhe  funcfions  wifh  empirical  unlabeled  error 
exacfly  tk  and  choose  a  prefix  such  fhaf  fhe  size  condifion  holds.)  Also,  we  may  assume  wifhouf  loss  of 
generalify  fhaf  C^impS]  >  1.  Then,  using  bounds  of  |71|  (see  also  Appendix  A),  we  have  fhaf  wifh 
probabilify  af  leasf  1  —  5/2^+^,  all  f  £  Ck  \  Ck-i  satisfy: 


errif)  <  errif)  +  J—  log(Cfc[mz,  S])  +  4^/—  log(2*^+3/5) 

y  mi  y  mi 

<  errif)  +  J—  log(Cfc[mz,  S])  +  4^/—  log(2*:)  +  4. /—  log(8/5) 

y  mi  y  mi  y  mi 

<  errif)  +  J  —logiCk[mi,  S])  +  ./— log(8'=)  +  4./— log(8/5) 

y  mi  y  mi  y  mi 

<  errif)  +  2. /—  log(Cfc[m/,  S])  +  4^/—  log(8/5) 

y  mi  y  mi 

<  errif)  +  e(/')  +  4. /—  log(8/5). 

V  mi 

Now,  lef /*  =  argminj*[err(//)+e(//)].  By  Hoeffding  bounds,  wifh  probabilify  af  leasf  1  —  5/2  we  have 
^rif*)  <  errif* )  +  y^log(2 /5)  /  (2m; ) .  Also,  by  consfrucfion  we  have  efr(/)+e(/)  <  err(/*)+e(/*). 


26 


Therefore  with  probability  at  least  1  —  5  we  have: 


err{f)  <  err{f)  +  e(/)  +  4y/log(8/(5)/m; 

<  err{f*)  +  e{f*)  +  A^J\og{S  /  5)  /  mi 

<  err{f*)  +  e{f*)  +  5y/log(8/<5)/mz 


as  desired.  ■ 

The  above  result  bounds  the  error  of  the  function  /  produced  in  terms  of  the  quantity  e{f*)  which  de¬ 
pends  on  the  empirical  unlabeled  error  rate  of  f* .  If  our  unlabeled  sample  S  is  sufficiently  large  to  estimate 
all  unlabeled  error  rates  to  ±e,  then  with  high  probability  we  have  err(/j*)  <  f  +  e,  so  e{f^)  <  ct+e,  and 

moreover  C5^^(f+e)  C  C'D^^(f+2e).  So,  our  error  term  e{ft)  is  at  most  log  {SCD,x{t  +  2e)[mi,  S]). 
Recall  that  our  ideal  error  term  e*  for  the  case  that  t  was  given  to  the  algorithm  in  advance,  factoring  out  the 

dependence  on  <5,  was  ^  log  ^8CD,x{t  +  2e)[2m;,  |71|  show  that  for  any  class  C,  the  quantity 

log(C[m,  S])  is  tightly  concentrated  about  log(C'[m,  D])  (see  also  Theorem  A.  1.6  in  the  Appendix  A.  1.2), 
so  up  to  multiplicative  constants,  these  two  bounds  are  quite  close. 

Interpretation  and  use  of  unlabeled  error  rate  as  a  regularizer:  The  above  theorem  suggests  to  op¬ 
timize  the  sum  of  the  empirical  labeled  error  rate  and  an  estimation-error  bound  based  on  the  unlabeled 
error  rate.  A  common  related  approach  used  in  practice  in  machine  learning  (e.g.,  |82|)  is  to  just  di¬ 
rectly  optimize  the  sum  of  the  two  kinds  of  error:  i.e.,  to  find  argminj[err(/)  -|-  efr„„;(/)].  However, 
fhis  is  nof  generically  jusfified  in  our  framework,  because  fhe  labeled  and  unlabeled  error  rales  are  really 
of  differenl  “lypes”.  In  particular,  depending  on  fhe  concepf  class  and  nofion  of  compalibilily,  a  small 
change  in  unlabeled  error  rale  could  subsfanlially  change  fhe  size  of  fhe  compalible  sel.^  For  example, 
suppose  all  funclions  in  C  have  unlabeled  error  rale  0.6,  excepl  for  Iwo:  funclion  /o  has  unlabeled  er¬ 
ror  rale  0  and  labeled  error  rale  1/2,  and  funclion  /o.s  has  unlabeled  error  rale  0.5  and  labeled  error 
rale  1/10.  Suppose  also  lhal  C  is  sufficienlly  large  lhal  wilh  high  probabilily  il  conlains  some  func¬ 
tions  /  lhal  drastically  overfil,  giving  err{f)  =  0  even  Ihough  Iheir  Irue  error  is  close  lo  1/2.  In  Ihis 
case,  we  would  like  our  algorilhm  lo  pick  oul  /o.s  (since  ils  labeled  error  rale  is  fairly  low,  and  we 
cannol  Irusl  Ihe  functions  of  unlabeled  error  0.6).  However,  even  if  we  use  a  regularization  parame¬ 
ter  A,  Ihere  is  no  way  lo  make  /o.s  =  argminj[err(/)  -|-  Aerr„„;(/)]:  in  particular,  one  cannol  have 
1/10-1-  0.5A  <  min[l /2  -|-  OA,  0  -|-  0.6A] .  So,  in  Ihis  case,  Ihis  approach  will  nol  have  Ihe  desired  behavior. 
Note:  One  could  further  derive  tighter  bounds,  both  in  terms  of  labeled  and  unlabeled  examples,  that  are 
based  on  other  distribution  dependent  complexity  measures  and  using  stronger  concentration  results  (see 
e.g.  [721). 

2.3.2  e-Cover-based  Bounds 

The  results  in  the  previous  section  are  uniform  convergence  bounds:  they  provide  guarantees  for  any 
algorithm  that  optimizes  over  the  observed  data.  In  this  section,  we  consider  stronger  bounds  based  on 
e-covers  that  apply  to  algorithms  that  behave  in  a  specific  way:  they  first  use  the  unlabeled  examples  to 
choose  a  “representative”  set  of  compatible  hypotheses,  and  then  use  the  labeled  sample  to  choose  among 
these.  Bounds  based  on  e-covers  exist  in  the  classical  PAC  setting,  but  in  our  framework  these  bounds 
and  algorithms  of  this  type  are  especially  natural,  and  the  bounds  are  often  much  lower  than  what  can  be 
achieved  via  uniform  convergence.  For  simplicity,  we  restrict  ourselves  in  this  section  to  the  realizable 

^On  the  other  hand,  for  certain  compatibility  notions  and  under  certain  natural  assumptions,  one  can  use  unlabeled  error  rate 
directly,  e.g.,  see  e.g.,  1 196|. 


27 


case.  However  one  can  combine  ideas  in  Section  2.3.1  with  ideas  in  this  section  in  order  to  derive  bounds 
in  the  agnostic  case  as  well.  We  first  present  our  generic  bounds.  In  Section  2.3.2  we  discuss  natural 
settings  in  which  they  can  be  especially  useful,  and  in  then  Section  2.3.2  we  present  even  tighter  bounds 
for  co-training. 

Recall  that  a  set  C  2^  is  an  e-cover  for  C  with  respect  to  D  if  for  every  f  ^  C  there  is  a  /'  G 
which  is  e-close  to  /.  That  is,  Pr^;^  nifix)  +  fix))  <  e. 

We  start  with  a  theorem  that  relies  on  knowing  a  good  upper  bound  on  the  unlabeled  error  rate  of  the 
target  function  err  uni  (c*)- 

Theorem  2.3.9  Assume  c*  £  C  and  let  p  be  the  size  of  a  minimum  e-cover  for  Cu^ferrunif*)  +  2e). 
Then  using  niu  unlabeled  examples  and  mi  labeled  examples  for 


m, 


=  O 


ma,x[VCdimiC),VCdimixiC))]  1  12' 

- ^ log  _  +  _  log  -  )  and  mi 


we  can  with  probability  1  —  5  identify  a  hypothesis  f  €  C  with  err(/)  <  6e. 

Proof:  Let  t  =  errunif*)-  Now,  given  the  unlabeled  sample  Su,  define  C"  C  C  as  follows:  for 
every  labeling  of  Su  fhaf  is  consisfenf  wifh  some  /  in  C,  choose  a  hypofhesis  in  C  for  which  efruniif)  is 
smallesf  among  all  fhe  hypofheses  corresponding  fo  fhaf  labeling.  Nexf,  we  obfain  Ce  by  eliminafing  from 
C'  fhose  hypofheses  /  wifh  fhe  properfy  fhaf  efruniif)  >  f  +  e-  We  fhen  apply  a  greedy  procedure  on 
fo  obfain  Ge  =  {^i,  •  •  •  ,  Ps},  as  follows: 

Inifialize  =  Cg  and  i  =  1. 

1.  Lef  Pi  =  argmin  effuniif)- 

2.  Using  fhe  unlabeled  sample  Su,  determine  by  delefing  from  U*  fhose  hypofheses  /  wifh  fhe 
properfy  fhaf  fpi,  f)  <  3e. 

3.  If  =  0  fhen  sef  s  =  f  and  slop;  else,  increase  f  by  1  and  goto  1. 

We  now  show  fhaf  wifh  high  probabilify,  is  a  5e-cover  of  C£),xit)  wifh  respecf  to  D  and  has  size 
al  mosf  p.  Firsl,  our  bound  on  is  sufficienl  fo  ensure  fhaf  wifh  probabilify  >  1  —  we  have  (a) 
\dif,p)  -dif,p)\  <  e  for  all  f,pe  C  and  (b)  |erf  „„;(/)  -  erruniif)\  <  e  for  all  /  E  C.  Lef  us  assume 
in  fhe  remainder  fhaf  fhis  (a)  and  (b)  are  indeed  safisfied.  Now,  (a)  implies  fhaf  any  Iwo  funcfions  in  C  fhaf 
agree  on  Su  have  dislance  af  mosf  e,  and  fherefore  C'  is  an  e-cover  of  C.  Using  (b),  fhis  in  furn  implies 
fhaf  Cg  is  an  e-cover  for  CD,xit)-  By  conslrucfion,  is  a  3e-cover  of  G^  wifh  respecf  to  dislribulion 
Su,  and  Ihus  (using  (a))  G^  is  a  4e-cover  of  G^  wifh  respecf  to  D,  which  implies  fhaf  G^  is  a  5e-cover  of 
CD,xf)  with  respect  to  D. 

We  now  argue  that  G^  has  size  at  mostp.  Fix  some  optimal  e-cover  {f, ,  fp}  of  CD,xfxruniic*)f 
2e).  Consider  function  pi  and  suppose  that  pi  is  covered  by  f„{iy  Then  the  set  of  functions  deleted  in 
step  (2)  of  the  procedure  include  those  functions  /  satisfying  dipi,  f)  <  2e  which  by  triangle  inequality 
includes  those  satisfying  dif^y) ,  f)  <  c.  Therefore,  the  set  of  functions  deleted  include  those  covered  by 
/o-(j)  and  so  for  all  j  >  i,  cj(j)  f  <7(1);  in  particular,  a  is  1-1.  This  implies  that  G^  has  size  at  most  p. 

Finally,  to  learn  c*  we  simply  output  the  function  /  E  G^  of  lowest  empirical  error  over  the  labeled 
sample.  By  Chernoff  bounds,  the  number  of  labeled  examples  is  enough  to  ensure  that  with  probability 
>  1  —  I  the  empirical  optimum  hypothesis  in  G^  has  true  error  at  most  6e.  This  implies  that  overall,  with 
probability  >  1  —  <5,  we  find  a  hypofhesis  of  error  af  mosf  6e.  ■ 

Note  fhaf  Theorem  2.3.9  relies  on  knowing  a  good  upper  bound  on  errunif*).  If  we  do  nol  have 
such  an  upper  bound,  fhen  one  can  perform  a  strafificafion  as  in  Secfions  2.3.1  and  2.3.1  For  example, 
if  we  have  a  desired  maximum  error  rate  e  and  we  do  nof  know  a  good  upper  bound  for  err  uni  (c* )  buf 


28 


we  have  the  ability  to  draw  additional  labeled  examples  as  needed,  then  we  can  simply  run  the  procedure 
in  Theorem  2.3.9  for  various  value  of  p,  testing  on  each  round  to  see  if  the  hypothesis  /  found  indeed 
has  zero  empirical  labeled  error  rate.  One  can  show  that  m;  =  O  In  labeled  examples  are  sufficient 
in  total  for  all  the  “validation”  steps.^  If  the  number  of  labeled  examples  rrii  is  fixed,  then  one  can  also 
perform  a  stratification  over  the  target  error  e. 


Some  illustrative  examples 

To  illustrate  the  power  of  e-cover  bounds,  we  now  present  two  examples  where  these  bounds  allow  for 
learning  from  significantly  fewer  labeled  examples  than  is  possible  using  uniform  convergence. 

Graph-based  learning:  Consider  the  setting  of  graph-based  algorithms  (e.g..  Example  3).  In  particular, 
the  input  is  a  graph  g  where  each  node  is  an  example  and  C  is  the  class  of  all  boolean  functions  over  the 
nodes  of  g.  Let  us  define  the  incompatibility  of  a  hypothesis  to  be  the  fraction  of  edges  in  g  cut  by  it. 
Suppose  now  that  the  graph  g  consists  of  two  cliques  of  n/2  vertices,  connected  together  by  ere^/4  edges. 
Suppose  the  target  function  c*  labels  one  of  the  cliques  as  positive  and  one  as  negative,  so  the  target  func¬ 
tion  indeed  has  unlabeled  error  rate  less  than  e.  Now,  given  any  set  Sl  of  mi  <  en/4  labeled  examples, 
there  is  always  a  highly-compatible  hypothesis  consistent  with  Sl  that  just  separates  the  positive  points 
in  S'/,  from  the  entire  rest  of  the  graph:  the  number  of  edges  cut  will  be  at  most  nmi  <  en?  j A.  However, 
such  a  hypothesis  has  true  error  nearly  1/2  since  it  has  less  than  en/4  positive  examples.  So,  we  do  not 
yet  have  uniform  convergence  over  the  space  of  highly  compatible  hypotheses,  since  this  hypothesis  has 
zero  empirical  error  but  high  true  error.  Indeed,  this  illustrates  an  overfitting  problem  that  can  occur  with 
a  direct  minimum-cut  approach  to  learning  [59,  67,  140 j.  On  the  other  hand,  the  set  of  functions  of  unla¬ 
beled  error  rate  less  than  e  has  a  small  e-cover:  in  particular,  any  partition  of  g  that  cuts  less  than  en^/4 
edges  must  be  e-close  to  (a)  the  all-positive  function,  (b)  the  all-negative  function,  (c)  the  target  function 
c*,  or  (d)  the  complement  of  the  target  function  1  —  c*.  So,  e-cover  bounds  act  as  if  the  concept  class  had 
only  4  functions  and  so  by  Theorem  2.3.9  we  need  only  0{j  log  y)  labeled  examples  to  learn  well.^  (In 
fact,  since  the  functions  in  the  cover  are  all  far  from  each  other,  we  really  need  only  0(log  y)  examples. 
This  issue  is  explored  further  in  Theorem  2.3. 11). 

Simple  co-training:  For  another  case  where  e-cover  bounds  can  beat  uniform-convergence  bounds,  imag¬ 
ine  examples  are  pairs  of  points  in  {0,  l}*^,  C  is  the  class  of  linear  separators,  and  compatibility  is  deter¬ 
mined  by  whether  both  points  are  on  the  same  side  of  the  separator  (i.e.,  the  case  of  Example  4).  Now 
suppose  for  simplicity  that  the  target  function  just  splits  the  hypercube  on  the  first  coordinate,  and  the 
distribution  is  uniform  over  pairs  having  the  same  first  coordinate  (so  the  target  is  fully  compatible).  We 
then  have  the  following. 

Theorem  2.3.10  Given  poly{d)  unlabeled  examples  Sjj  and  y  log  d  labeled  examples  Sl,  with  high  prob¬ 
ability  there  will  exist  functions  of  true  error  1/2  —  2~2^  that  are  consistent  with  Sl  and  compatible 
with  Su- 

Proof:  Let  V  be  the  set  of  all  variables  (not  including  xf)  that  (a)  appear  in  every  positive  example 
of  Sl  and  (b)  appear  in  no  negative  example  of  Sl-  In  other  words,  these  are  variables  xi  such  that 

''Specifically,  note  that  as  we  increase  t  (our  current  estimate  for  the  unlabeled  error  rate  of  the  target  function),  the  associated 
p  (which  is  an  integer)  increases  in  discrete  jumps,  pi,p2,  ■  ■  ■■  We  can  then  simply  spread  the  “5”  parameter  across  the  different 
runs,  in  particular  run  i  would  use  5/i{i  -f  1).  Since  pi  >  i,  this  implies  that  mi  =  O  7  In  |  labeled  examples  are  sufficient 
for  all  the  “validation”  steps. 

^Effectively,  e-cover  bounds  allow  one  to  rule  out  a  hypothesis  that,  say,  just  separates  the  positive  points  in  Sl  from  the  rest 
of  the  graph  by  noting  that  this  hypothesis  is  very  close  (with  respect  to  D)  to  the  all-negative  hypothesis,  and  that  hypothesis 
has  a  high  labeled-error  rate. 


29 


the  function  f{x)  =  Xi  correctly  classifies  all  examples  in  Sl-  Over  the  draw  of  Sl,  each  variable  has  a 
(1/2)2|Sl|  _  I  ^  chance  of  belonging  to  1/,  so  the  expected  sizeofl/is(d— 1)  /  y/d  and  so  by  Chemoff 
bounds,  with  high  probability  V  has  size  at  least  ^Vd.  Now,  consider  the  hypothesis  corresponding  to 
the  conjunction  of  all  variables  in  V.  This  correctly  classifies  fhe  examples  in  Sl,  and  wifh  probabilify 
af  leasf  1  —  if  classifies  evety  ofher  example  in  Sjj  negafive  because  each  example  in  Sjj  has 

only  a  l/2l^l  chance  of  satisfying  every  variable  in  V.  Since  \Su\  =  poly(d),  fhis  means  fhaf  wifh  high 
probabilify  fhis  conjuncfion  is  compafible  wifh  Su  and  consisfenf  wifh  Sl,  even  fhough  ifs  frue  error  is  af 
leasf  1/2  -  2“  ■ 

So,  given  only  a  sef  Su  of  poly((i)  unlabeled  examples  and  a  sef  Sl  of  \  \ogd  labeled  examples 
we  would  nof  wanf  fo  use  a  uniform  convergence  based  algorifhm  since  we  do  nol  yef  have  uniform 
convergence.  In  confrasf,  fhe  cover-size  of  fhe  sef  of  functions  compafible  wifh  Su  is  consfanf,  so  e-cover 
based  bounds  again  allow  learning  from  Jusf  only  0{j  log  j)  labeled  examples  (Theorem  2.3.9).  In  facl 

as  we  show  in  Theorem  2.3.11  we  only  need  O  ^logi  labeled  examples  in  fhis  case. 


Learning  from  even  fewer  labeled  examples 

In  some  cases,  unlabeled  dafa  can  allow  us  fo  learn  from  even  fewer  labeled  examples  fhan  given  by  The¬ 
orem  2.3.9  In  parficular,  consider  a  co-fraining  selling  where  fhe  largel  c*  is  fully  compafible  and  D  sal- 
isfies  fhe  properly  fhaf  fhe  Iwo  views  xi  and  X2  are  conditionally  independenl  given  fhe  label  c*((xi,  X2)). 
As  shown  by  |62|,  one  can  boosl  any  weak  hypofhesis  from  unlabeled  dafa  in  fhis  selling  (assuming  one 
has  enough  labeled  dafa  fo  produce  a  weak  hypofhesis).  Relaled  sample  complexify  resulls  are  given 
in  [97 1 .  In  facl,  we  can  use  fhe  notion  of  e-covers  fo  show  fhaf  we  can  learn  from  jusf  a  single  labeled 
example.  Specifically,  for  any  concepf  classes  Ci  and  C2,  we  have: 

Theorem  2.3.11  Assume  that  err{c*)  =  ervuniic*)  =  0  and  D  satisfies  independence  given  the  label. 
Then  for  any  r  <  e/4,  using  rriu  unlabeled  examples  and  mi  labeled  examples  we  can  find  a  hypothesis 
that  with  probability  1  —  5  has  error  at  most  e,  for 

rriu  =  O  (  —  {VCdim{Ci)  -|-  VCdim{C2))  In  — h  In  -- 

\t  [  TO 

Proof:  We  will  assume  for  simplicily  fhe  selling  of  Example  3,  where  c*  =  c\  =  C2  and  also  Di  = 
D2  =  D  (fhe  general  case  is  handled  similarly,  bul  jusf  requires  more  nolalion). 

We  sfarl  by  characlerizing  fhe  hypofheses  wifh  low  unlabeled  error  rale.  Recall  fhaf  x{f,  D)  = 
{xi,x2)'^D[f  {xi)  =  f{x2)],  and  for  concreteness  assume  /  predicls  using  xi  if  /(xi)  7^  /(X2).  Con¬ 
sider  f  £  C  wifh  erruniif)  <  x  and  lei’s  define  p_  =  Pr^g^,  [c*{x)  =  0],  p+  =  Pr^g^  [c*{x)  =  1]  and 

for  i,  j  G  {0, 1}  define  pij  =  Pr^g^  [f{x)  =  i,  c*{x)  =  j].  We  clearly  have  err  (/)  =  pio  -|-  poi-  From 

erruniif)  =  Pi'fxi.xal'-D  [/  (xi)  f  (X2)]  <  r,  using  fhe  independence  given  fhe  label  of  D,  we  gel 

2pioPoo  ,  2poi7'ii  . 

PlO+POO  Poi  +  Pll 

In  particular,  fhe  facl  fhaf  <  r  implies  fhaf  we  cannol  have  bolh  piQ  >  r  and  poo  >  x,  and  fhe 

facl  fhaf  <  T  implies  fhaf  we  cannol  have  bolh  poi  >  x  and  pn  >  r.  Therefore,  any  hypofhesis 

/  wifh  erruniif)  <  x  falls  in  one  of  fhe  following  categories: 

1.  /is  “close  fo  c*”:  pio  <  r  and  poi  <  r;  so  err  if)  <  2r. 

2.  /  is  “close  fo  c*”:  poo  <  x  and  pn  <  r;  so  err(/)  >  1  —  2r. 


30 


3.  /  “almost  always  predicts  negative”:  for  pio  <  r  andpn  <  r;  so  Pr[/(a:)  =  0]  >  1  —  2r. 

4.  /  “almost  always  predicts  positive”:  for  poo  <  t  andpoi  <  so  Pr[/(x)  =  0]  <  2t. 

Let  /i  be  the  constant  positive  function  and  /o  be  the  constant  negative  function.  Now  note  that  our 
bound  on  mu  is  sufficient  to  ensure  that  with  probability  >  1  —  we  have  (a)  \d{f^g)  —  d{f,g)\  <  r 
for  all  f,g^C  and  (b)  all  /  G  C  with  erruniif)  =  0  satisfy  ervuniif)  <  t  ■  Let  us  assume  in  the 
remainder  that  this  (a)  and  (b)  are  indeed  satisfied.  By  our  previous  analysis,  fhere  are  af  mosf  four  kinds 
of  hypofheses  consisfenf  wifh  unlabeled  dafa:  fhose  close  fo  c*,  fhose  close  fo  ifs  complement  c*,  those 
close  to  /o,  and  those  close  to  /i.  Furthermore,  c*,  c*,  fo,  and  /i  are  compatible  with  the  unlabeled  data. 

So,  algorithmically,  we  first  check  to  see  if  there  exists  a  hypothesis  p  G  C  with  efruni{g)  =  0  such 
that  d{fi,g)  >  3t  and  d{fo,  g)  >  3r.  If  such  a  hypothesis  g  exists,  then  it  must  satisfy  either  case  (1) 
or  (2)  above.  Therefore,  we  know  that  one  of  {g,  g]  is  2r-close  to  c* .  If  not,  we  must  have  p+  <  4r  or 
p_  <  4t,  in  which  case  we  know  that  one  of  {/o,  /i}  is  4r-cIose  to  c* .  So,  either  way  we  have  a  set  of  two 
functions,  opposite  to  each  other,  one  of  which  is  at  least  dr-cIose  to  c* .  We  finally  use  0(log  i  |)  labeled 
examples  fo  pick  one  of  fhese  fo  oufpuf,  namely  fhe  one  wifh  lowesf  empirical  labeled  error.  Lemma  2.3.12 
below  fhen  implies  fhaf  wifh  probabilify  1  —  <5  fhe  funcfion  we  oufpuf  has  error  af  mosf  4r  <  e.  ■ 

Lemma  2.3.12  Consider  t  <  |.  Let  Cr  =  {/,  /}  be  a  subset  of  C  containing  two  opposite  hypotheses 
with  the  property  that  one  of  them  is  T-close  to  c*.  Then, mi  >  Glogi-i^  (j)  labeled  examples  are  sufficient 
so  that  with  probability  >  1  —  <5,  the  concept  in  Cf  that  is  T-close  to  c*  in  fact  has  lower  empirical  error. 

Proof:  We  need  fo  show  fhaf  if  mi  > 

L^J 

E 

k=0 

Since  r  <  |  we  have: 

L^J 

^  (1 
k=0  ^  ^  2 

and  so  S'  <  {^/T  ■  2)”*'  .  For  r  <  |  and  mi  >  6 | j  =  61og|-i^  (|)  if’s  easy  fo  see  fhaf  (-^/r  •  2)”^'  < 
6,  which  implies  fhe  desired  resulf.  ■ 

In  particular,  by  reducing  r  fo  poly (5)  in  Theorem  2.3.11,  we  can  reduce  fhe  number  of  labeled 
examples  needed  mi  fo  one.  Nofe  however  fhaf  we  will  need  polynomially  more  unlabeled  examples. 

In  facl,  fhe  resulf  in  Theorem  2.3.11  can  be  extended  fo  fhe  case  fhaf  D~^  and  D~  merely  satisfy 
consfanf  expansion  rafher  fhan  full  independence  given  fhe  label,  see  |28|. 

Note:  Theorem  2.3.1 1  illusfrafes  fhaf  if  dafa  is  especially  well  behaved  wifh  respecf  fo  fhe  compafibilify 
notion,  fhen  our  bounds  on  labeled  dafa  can  be  exfremely  good.  In  Section  2.4.2,  we  show  for  fhe  case  of 
linear  separators  and  independence  given  fhe  label,  we  can  give  efficient  algorifhms,  achieving  fhe  bounds 
in  Theorem  2.3.11  in  terms  of  labeled  examples  by  a  polynomial  lime  algorilhm.  Nofe,  however,  fhaf 
bolh  fhese  bounds  rely  heavily  on  fhe  assumpfion  fhaf  fhe  largel  is  fully  compalible.  If  fhe  assumpfion  is 
more  of  a  “hope”  fhan  a  belief,  fhen  one  would  need  an  additional  sample  of  1  /e  labeled  examples  jusf  to 
validale  fhe  hypofhesis  produced. 


Glogi  ( j),  fhen 

T 

<  5. 


31 


2.4  Algorithmic  Results 


In  this  section  we  give  several  examples  of  efficient  algorithms  in  our  model  that  are  able  to  learn  using 
sample  sizes  comparable  to  those  described  in  Section  2.3.  Note  that  our  focus  is  on  achieving  a  low-error 
hypothesis  (also  called  minimizing  0-1  loss).  Another  common  practice  in  machine  learning  (both  in  the 
context  of  supervised  and  semi-supervised  learning)  is  to  instead  try  to  minimize  a  surrogate  convex  loss 
that  is  easier  to  optimize  [82|.  While  this  does  simplify  the  computational  problem,  it  does  not  in  general 
solve  the  true  goal  of  achieving  low  error. 

2.4.1  A  simple  case 

We  give  here  a  simple  example  to  illustrate  the  bounds  in  Section  2.3.1,  and  for  which  we  can  give  a 
polynomial-time  algorithm  that  takes  advantage  of  them.  Let  the  instance  space  X  =  {0,  l}*^,  and  for 
X  €  X,  let  vars(x)  be  the  set  of  variables  set  to  1  in  the  feature  vector  x.  Let  C  be  the  class  of  monotone 
disjunctions  (e.g.,  xi  V  X3  V  xq),  and  for  f  £  C,  let  vars(/)  be  the  set  of  variables  disjoined  by  /. 
Now,  suppose  we  say  an  example  x  is  compatible  with  function  /  if  either  vars(x)  C  vars(/)  or  else 
vars(x)  n  vars(/)  =  (j).  This  is  a  very  strong  notion  of  “margin”:  it  says,  in  essence,  that  every  variable 
is  either  a  positive  indicator  or  a  negative  indicator,  and  no  example  should  contain  both  positive  and 
negative  indicators. 

Given  this  setup,  we  can  give  a  simple  PAG -learning  algorithm  for  this  pair  {C,x)'-  that  is,  an 
algorithm  with  sample  size  bounds  that  are  polynomial  (or  in  this  case,  matching)  those  in  Theorem  2.3.1 
Specifically,  we  can  prove  the  following: 

Theorem  2.4.1  The  class  C  of  monotone  disjunctions  is  P AC  ^ni-l^cirnable  under  the  compatibility  notion 
defined  above. 

Proof:  We  begin  by  using  our  unlabeled  data  to  construct  a  graph  on  d  vertices  (one  per  variable), 
putting  an  edge  between  two  vertices  i  and  j  if  there  is  any  example  x  in  our  unlabeled  sample  with 
i,j  G  vars(x).  We  now  use  our  labeled  data  to  label  the  components.  If  the  target  function  is  fully 
compatible,  then  no  component  will  get  multiple  labels  (if  some  component  does  get  multiple  labels,  we 
halt  with  failure).  Finally,  we  produce  the  hypothesis  /  such  that  vars(/)  is  the  union  of  the  positively- 
labeled  components.  This  is  fully  compatible  with  the  unlabeled  data  and  has  zero  error  on  the  labeled 
data,  so  by  Theorem  2.3.1,  if  the  sizes  of  the  data  sets  are  as  given  in  the  bounds,  with  high  probability  the 
hypothesis  produced  will  have  error  at  most  e.  ■ 

Notice  that  if  we  want  to  view  the  algorithm  as  “purchasing”  labeled  data,  then  we  can  simply  ex¬ 
amine  the  graph,  count  the  number  of  connected  components  k,  and  then  request  ^  [/c  In  2  -|-  In  |]  labeled 
examples.  (Here,  2^  =  |C'5^;;^(0)|.)  By  the  proof  of  Theorem  2.3.1 ,  with  high  probability  2^  <  |C'z)^^(e)|, 
so  we  are  purchasing  no  more  than  the  number  of  labeled  examples  in  the  theorem  statement. 

Also,  it  is  interesting  to  see  the  difference  between  a  “helpful”  and  “non-helpful”  distribution  for  this 
problem.  An  especially  non-helpful  distribution  would  be  the  uniform  distribution  over  all  examples  x 
with  |vars(x)|  =  1,  in  which  there  are  d  components.  In  this  case,  unlabeled  data  does  not  help  at  all,  and 
one  still  needs  H(d)  labeled  examples  (or,  even  H  (^)  if  the  distribution  is  non-uniform  as  in  the  lower 
bounds  of  [105]).  On  the  other  hand,  a  helpful  distribution  is  one  such  that  with  high  probability  the 
number  of  components  is  small,  such  as  the  case  of  features  appearing  independently  given  the  label. 

2.4.2  Co-training  with  linear  separators 

We  now  consider  the  case  of  co-training  where  the  hypothesis  class  C  is  the  class  of  linear  separators.  For 
simplicity  we  focus  first  on  the  case  of  Example  4:  the  target  function  is  a  linear  separator  in  R'^  and  each 


32 


example  is  a  pair  of  points,  both  of  which  are  assumed  to  be  on  the  same  side  of  the  separator  (i.e.,  an 
example  is  a  line-segment  that  does  not  cross  the  target  hyperplane).  We  then  show  how  our  results  can 
be  extended  to  the  more  general  setting. 

As  in  the  previous  example,  a  natural  approach  is  to  try  to  solve  the  “consistency”  problem:  given  a  set 
of  labeled  and  unlabeled  data,  our  goal  is  to  find  a  separator  that  is  consistent  with  the  labeled  examples 
and  compatible  with  the  unlabeled  ones  (i.e.,  it  gets  the  labeled  data  correct  and  doesn’t  cut  too  many 
edges).  Unfortunately,  this  consistency  problem  is  NP-hard:  given  a  graph  g  embedded  in  with  two 
distinguished  points  s  and  t,  it  is  NP-hard  to  find  fhe  linear  separator  wifh  s  on  one  side  and  t  on  fhe 
ofher  fhaf  cufs  fhe  minimum  number  of  edges,  even  if  the  minimum  is  zero  1 108 1.  For  fhis  reason,  we  will 
make  an  addifional  assumption,  fhaf  fhe  fwo  poinfs  in  an  example  are  each  drawn  independently  given 
the  label.  Thai  is,  fhere  is  a  single  disfribufion  D  over  and  wifh  some  probabilify  p_|_,  fwo  poinfs 
are  drawn  i.i.d.  from  Dj^  (D  resfricfed  fo  fhe  posifive  side  of  fhe  fargef  function)  and  wifh  probabilify 
1  —  P+,  fhe  fwo  are  drawn  i.i.d  from  (D  resfricfed  fo  fhe  negative  side  of  fhe  fargef  funcfion).  Nofe 
fhaf  our  sample  complexify  resulfs  in  Secfion  2.3.2  exfend  fo  weaker  assumptions  such  as  disfribufional 
expansion  infroduced  by  |28|,  buf  we  need  frue  independence  for  our  algorifhmic  resulfs.  |62|  also  give 
posifive  algorifhmic  resulfs  for  co-fraining  when  (a)  fhe  fwo  views  of  an  example  are  drawn  independenfly 
given  fhe  label  (which  we  are  assuming  now),  (b)  fhe  underlying  funcfion  is  learnable  via  Sfafisfical 
Query  algorifhms^  (which  is  frue  for  linear  separafors  [64 1),  and  (c)  we  have  enough  labeled  dafa  to 
produce  a  weakly-useful  hypofhesis  (defined  below)  on  one  of  fhe  views  fo  begin  wifh.  We  give  here  an 
improvemenf  over  fhaf  resulf  by  showing  how  we  can  run  fhe  algorifhm  in  1 62  j  wifh  only  a  single  labeled 
example,  fhus  obfaining  an  efficienl  algorifhm  in  our  model.  If  is  worth  noficing  fhaf  in  fhe  process,  we 
also  somewhaf  simplify  fhe  resulfs  of  [  64  j  on  efficienfly  learning  linear  separafors  wifh  noise  wifhouf  a 
margin  assumpfion. 

For  fhe  analysis  below,  we  need  fhe  following  definifion.  A  weakly-useful  predicfor  is  a  funcfion  / 
such  fhaf  for  some  (3  fhaf  is  af  leasf  inverse  polynomial  in  fhe  inpuf  size  we  have: 


Pr[/(x)  =  l|c*(x)  =  1]  >  Pr[/(x)  =  l|c*(x)  =  0]  +  /3.  (2.2) 

If  is  equivalenf  to  fhe  usual  nofion  of  a  “weak  hypofhesis”  [  149]  when  fhe  fargef  funcfion  is  balanced, 
buf  requires  fhe  hypofhesis  give  more  informafion  when  fhe  fargef  funcfion  is  unbalanced  |62|.  Also, 
we  will  assume  for  convenience  fhaf  fhe  fargef  separafor  passes  fhrough  fhe  origin,  and  lef  us  denofe  fhe 
separator  by  c*  •  x  =  0. 

We  now  describe  an  efficienf  algorifhm  fo  learn  fo  any  desired  error  rale  e  in  fhis  selling  from  jusl 
a  single  labeled  example.  For  clarify,  we  firsl  describe  an  algorifhm  whose  running  lime  depends  poly- 
nomially  on  bolh  fhe  dimension  d  and  I/7,  where  7  is  a  sofl  margin  of  separalion  belween  positive  and 
negafive  examples.  Formally,  in  fhis  case  we  assume  fhaf  al  leasf  some  non-negligible  probability  mass  of 
examples  x  salisfy  >  7;  i-e.,  they  have  distance  at  least  7  to  the  separating  hyperplane  x  •  c*  =  0 

after  normalization.  This  is  a  common  type  of  assumption  in  machine  learning  (in  fact,  often  one  makes 
the  much  stronger  assumption  that  nearly  all  probability  mass  is  on  examples  x  satisfying  this  condition). 
We  then  show  how  one  can  replace  the  dependence  on  1  /y  with  instead  a  polynomial  dependence  on  the 
number  of  bits  of  precision  b  in  the  data,  using  the  Outlier  Removal  Lemma  of  [64  j  and  [  104]. 

Theorem  2.4.2  Assume  that  at  least  an  a  probability  mass  of  examples  x  have  margin  >  7  with 

respect  to  the  target  separator  c*.  There  is  a  polynomial-time  algorithm  (polynomial  in  d,  I/7,  1/a,  1/e, 
and  lj6)  to  learn  a  linear  separator  under  the  above  assumptions,  from  a  polynomial  number  of  unlabeled 
examples  and  a  single  labeled  example. 

'’For  a  detailed  description  of  the  Statistical  Query  model  see  1 148|  and  1 149|. 


33 


Algorithm  1  Co-training  with  Linear  Separators.  The  Soft  Margin  Case. 

Input:  e,  (5,  T  a  set  Sl  of  mi  labeled  examples  drawn  i.i.d  from  D,  a  set  Su  of  unlabeled 
examples  drawn  i.i.d  from  D. 

Output:  Hypothesis  of  low  error. 

Let  hp  be  the  all-positive  function.  Let  hn  be  the  all-negative  function.  Let  r  =  e/6,  ei  =  t/4. 

(1)  For  i  =  1, . . . ,  T  do 

-  Choose  a  random  halfspace  /*  going  through  the  origin. 

-  Feed  fi,  Sjj  and  error  parameters  ei  and  confidence  parameter  6/6  into  the  bootstrapping 
procedure  of  [62|  to  produce  hi. 

(2)  Let /i  be  argmin^j,  /ip)  >  3T,d{h,hn)  >  3r|. 

If  erh-uni{hi)  >  3ei,  then  let  h  =  hp. 

(3)  Use  Sl  to  output  either  h  or  h:  output  the  hypothesis  with  lowest  empirical  error  on  the  set  Sl- 


Proof:  Let  e  and  6  be  the  desired  accuracy  and  confidence  parameters.  Let  T  =  O  log  (j)  j,  rriu  = 
poly(l/7, 1/a,  1/e,  1/6,  d),  andm/  =  1.  We  run  Algorithm  1  with  the  inputs  e,  6,  T  Sl,  Su,  andm;  =  1. 
Let  r  =  e/6,  ei  =  r/4. 

In  order  to  prove  the  desired  result,  we  start  with  a  few  facts. 

We  first  note  that  our  bound  on  is  sufficient  to  ensure  that  with  probability  >  1  —  we  have  (a) 
\d{f,g)  -d{f,g)\  <  Tforall/,p  e  C  and  (b)  all  /  e  C  have  | err „„;(/)  -  erruni{f)\  <  U- 

\X'C*  I 

We  now  argue  that  if  at  least  an  a  probability  mass  of  examples  x  have  margin  >  7  with 

respect  to  the  target  separator  c* ,  then  a  random  halfspace  has  at  least  a  poly  (a,  7)  probability  of  being 
a  weakly-useful  predictor.  (Note  that  [  64 1  uses  the  Perceptron  algorithm  to  get  weak  learning;  here,  we 
need  something  simpler  since  we  need  to  save  our  labeled  example  to  the  very  end.)  Specifically,  consider 
a  point  X  of  margin  72,  >  7.  By  definition,  the  margin  is  the  cosine  of  the  angle  between  x  and  c* ,  and 
therefore  the  angle  between  x  and  c*  is  7r/2  —  cos“^(73;)  <  7r/2  —  7.  Now,  imagine  that  we  draw  /  at 
random  subject  to  /  •  c*  >  0  (half  of  the  /’s  will  have  this  property)  and  define  f{x)  =  sign(/  •  x).  Then, 

PjFifix)  +  c*{x)\f  -c*  >0)  <  (7r/2  -  7)/7r  =  1/2  -  -f/rr. 

Moreover,  if  x  does  not  have  margin  7  fhen  af  fhe  very  least  we  have  Pry  (/(x)  /  c*  (x)  |  /  •  c*  >  0)  <  1/2. 

Now  define  distribufion  D*  =  \Dj^  +  \D-\  thaf  is  D*  is  the  distribution  D  but  balanced  to  50% 
positive  and  50%  negative.  With  respect  to  D*  at  least  m  a /2  probability  mass  of  the  examples  have 
margin  at  least  7,  and  therefore: 

E/[err^.(/)|/  •  c*  >  0]  <  1/2  -  {oi/2){'y /-n). 

Since  err  (/)  is  a  bounded  quantity,  by  Markov  inequality  this  means  that  at  least  an  0(07)  probability 
mass  of  functions  /  must  satisfy  err  d*  if)  <5  —  77  which  in  turn  implies  that  they  must  be  useful  weakly 
predictors  with  respect  to  D  as  defined  in  Equafion  (2.2)  wifh  (5  = 

The  second  part  of  the  argument  is  as  follows.  Note  that  in  Step(l)  of  our  algorithm  we  repeat  the 
following  process  for  T  iterations:  pick  a  random  f,  and  plug  it  into  the  bootstrapping  theorem  of  |62| 
(which,  given  a  distribution  over  unlabeled  pairs  (x{,  X2),  will  use  fi{x\)  as  a  noisy  label  of  x^,  feeding  the 

result  into  a  Statistical  Query  algorithm).  Since  T  =  O  f  ^  log  (7)  j ,  using  the  above  observation  about 


34 


random  halfspaces  being  weak  predictors,  we  obtain  that  with  high  probability  at  least  1  —  5/6,  at  least  one 
of  the  random  hypothesis  fi  was  a  weakly-useful  predictor;  and  since  niu  =  poly(l/7, 1/a, 1/e, 1/6,  d) 
we  also  have  the  associated  hypothesis  hi  output  by  the  bootstrapping  procedure  of  [62|  will  with  prob¬ 
ability  at  least  1  —  5/6  satisfy  err{hi)  <  ei-  This  implies  that  with  high  probability  at  least  1  —  25/3, 
at  least  one  of  the  hypothesis  hi  we  find  in  Step  1  has  true  labeled  error  at  most  ei-  For  the  rest  of  the 
hypotheses  we  find  in  Sfep  1,  we  have  no  guaranfees. 

We  now  observe  fhe  following.  Firsf  of  all,  any  funclion  /  wifh  small  err{f)  musf  have  small 
erruniif)',  in  particular, 

erruniif)  =  Pr(/(xi)  /  f{x2))  <  2err{f). 

This  implies  fhaf  wifh  high  probabilify  af  leasf  1  —  25/3,  af  leasf  one  of  fhe  hypofhesis  hi  we  find  in  Sfep 
1  has  frue  unlabeled  error  af  mosf  2ei,  and  Iherefore  empirical  unlabeled  error  af  mosf  3ei.  Secondly, 
because  of  fhe  assumption  of  independence  given  fhe  label,  as  shown  in  Theorem  2.3.1  T  wifh  high  prob¬ 
abilify  fhe  only  funcfions  wifh  unlabeled  error  af  mosf  r  are  funcfions  2r-close  fo  c*,  2r-close  fo  -ic*, 
2T-close  fo  fhe  “all  posifive”  funclion,  or  2r-close  fo  fhe  “all  negafive”  function. 

In  Sfep  (2)  we  firsf  examine  all  fhe  hypofheses  produced  in  Sfep  1,  and  we  pick  fhe  hypofhesis  h 
wifh  fhe  smallesl  empirical  unlabeled  error  rale  subjecl  fo  being  empirically  af  leasf  3r-far  from  fhe  “all- 
posilive”  or  “all-negalive”  functions.  If  fhe  fhe  empirical  error  rale  of  fhis  hypofhesis  h  is  af  mosf  3ei  we 
know  fhaf  ils  frue  unlabeled  error  rale  is  af  mosf  4ei  <  r,  which  furfher  implies  fhaf  eifher  h  or  -ifi  is  2r 
close  fo  c* .  However,  if  fhe  empirical  unlabeled  error  rale  of  h  is  greater  lhan  3ei,  fhen  we  know  fhaf  fhe 
largel  musf  be  4r-close  fo  fhe  all-posilive  or  all-negafive  funclion  so  we  simply  choose  h  -  “all  posifive” 
(fhis  is  frue  since  fhe  unlabeled  sample  was  large  enough  so  fhaf  \d{f,  g)  —  d{f,  g)  \  <  r). 

So,  we  have  argued  fhaf  wifh  probabilify  af  leasf  1  —  25/3  eifher  h  or  -i/i  is  4r-close  fo  c*.  We  can 
now  jusl  use  O  ^log^-i^  labeled  examples  fo  determine  which  case  is  which  (Lemma  2.3.12).  This 
quantify  is  af  mosf  1  and  our  error  rate  is  af  mosf  e  if  we  sel  r  <  e /4  and  r  sufficienlly  small  compared  fo 
5.  This  completes  fhe  proof.  ■ 

The  above  algorilhm  assumes  one  can  efficienlly  pick  a  random  unil-lenglh  veclor  in  bul  fhe 
argumenl  easily  goes  Ihrough  even  if  we  do  fhis  fo  only  0(log  I/7)  bils  of  precision. 

We  now  exlend  fhe  resull  fo  fhe  case  fhaf  we  make  no  margin  assumption. 

Theorem  2.4.3  There  is  a  polynomial-time  algorithm  ( in  d,  h,  1/ e,  and  1/5,  where  d  is  the  dimension  of 
the  space  and  b  is  the  number  of  bits  per  example )  to  learn  a  linear  separator  under  the  above  assumptions, 
from  a  polynomial  number  of  unlabeled  examples  and  a  single  labeled  example.  Thus,  we  efficiently 
PACunl-team  the  class  of  linear  separators  over  {—2^, . . . ,  2^  —  1,2^}'^  under  the  agreement  notion  of 
compatibility  if  the  distribution  D  satisfies  independence  given  the  label. 

Proof:  We  begin  by  drawing  a  large  unlabeled  sample  S  (of  size  polynomial  in  d  and  b).  We  fhen 
compule  a  linear  fransformafion  T  fhaf  when  applied  fo  S  has  fhe  properfy  fhaf  for  any  hyperplane  w  ■ 
X  =  0,  af  leasf  a  l/poly{d,  b)  fraction  of  T{S)  has  margin  af  leasf  l/poly{d,  b).  We  can  do  fhis  via  fhe 
Ouflier  Removal  Lemma  of  [64|  and  |104|.  Specifically,  fhe  Ouflier  Removal  Lemma  slates  fhaf  given 
a  sel  of  poinls  S,  one  can  algorifhmically  remove  an  e'  fraclion  of  S  and  ensure  fhaf  for  fhe  remaining 
sel  S',  for  any  veclor  w,  maXx^s'{w  ■  x)^  <  poly{d,b,l/e')'Ex^s/[{w  ■  x)^],  where  b  is  fhe  number 
of  bils  needed  fo  describe  fhe  inpul  poinls.  Given  such  a  sel  S',  one  can  fhen  use  ils  eigenvectors  to 
compule  a  slandard  linear  fransformafion  (also  described  in  |64|)  T  :  ^  ,  where  d'  <  d  is  fhe 

dimension  of  fhe  subspace  spanned  by  S' ,  such  fhaf  in  fhe  fransformed  space,  for  all  unil-lenglh  w,  we 
have  Ea,g'r(5/)[(r(;  •  x)^]  =  1.  In  parficular,  since  fhe  maximum  of  {w  ■  x)^  is  bounded,  fhis  implies  fhaf 
for  any  vector  w  G  R!^' ,  af  leasf  an  a  fraclion  of  poinls  x  G  T{S')  have  margin  af  leasf  a  for  some 
a  >  l/poly{b,  d,  1/e'). 


35 


Now,  choose  e'  =  e/4,  and  let  D'  be  the  distribution  D  restricted  to  the  space  spanned  by  S' .  By 
VC-dimension  bounds,  IS"!  =  0{d/a)  is  sufficient  so  that  with  high  probability,  (a)  D'  has  probability 
mass  at  least  1  —  e/2,  and  (b)  the  vector  T{c*)  has  at  least  an  a/2  probability  mass  of  T{D')  at  margin 
>  a.  Thus,  the  linear  transformation  T  converts  the  distribution  D'  into  one  satisfying  the  conditions 
needed  for  Theorem  2.4.2,  and  any  hypothesis  produced  with  error  <  e/2  on  42'  will  have  error  at  most  e 
on  D.  So,  we  simply  apply  T  to  D'  and  run  the  algorithm  for  Theorem  2.4.2  to  produce  a  low-error  linear 
separator.  ■ 

Note:  We  can  easily  extend  our  algorithm  to  the  standard  co-training  setting  (where  can  be  different 
from  C2)  as  follows:  we  repeat  the  procedure  in  a  symmetric  fashion,  and  then  just  try  all  combinations 
of  pairs  of  functions  returned  to  find  one  of  small  unlabeled  error  rate,  not  close  to  “all  positive”,  or  “all 
negative”.  Finally  we  use  O  ^log^-i^  (y)^  labeled  examples  to  produce  a  low  error  hypothesis  (and  here 
we  use  only  one  part  of  the  example  and  only  one  of  the  functions  in  the  pair). 


2.5  Related  Models 

In  this  section  we  discuss  a  transductive  analog  of  our  model,  some  connections  with  generative  models 
and  other  ways  of  using  unlabeled  data  in  Machine  Learning,  and  the  relationship  between  our  model  and 
the  luckiness  framework  of  [  191 1. 

2.5.1  A  Transductive  Analog  of  our  Model 

In  transductive  learning,  one  is  given  a  fixed  set  S  of  examples,  of  which  some  small  random  subset  is 
labeled,  and  the  goal  is  to  predict  well  on  the  rest  of  S.  That  is,  we  know  which  examples  we  will  be  tested 
on  up  front,  and  in  a  sense  this  a  case  of  learning  from  a  known  distribution  (the  uniform  distribution  over 
S).  We  can  also  talk  about  a  transductive  analog  of  our  inductive  model,  that  incorporates  many  of  the 
transductive  learning  methods  that  have  been  developed.  In  order  to  make  use  of  unlabeled  examples,  we 
will  again  express  the  relationship  we  hope  the  target  function  has  with  the  data  through  a  compatibility 
notion  x-  However,  since  in  this  case  the  compatibility  of  a  given  hypothesis  is  completely  determined 
by  S  (which  is  known),  we  will  not  need  to  require  that  compatibility  be  an  expectation  over  unlabeled 
examples.  From  the  sample  complexity  point  of  view  we  only  care  about  how  much  labeled  data  we  need, 
and  algorithmically  we  need  to  find  a  highly  compafible  hypofhesis  wifh  low  error  on  fhe  labeled  dafa. 

Rafher  fhan  presenfing  general  fheorems,  we  insfead  focus  on  fhe  modeling  question,  and  show  how 
a  number  of  existing  fransducfive  graph-based  learning  algorifhms  can  be  modeled  in  our  framework.  In 
fhese  mefhods  one  usually  assumes  fhaf  fhere  is  weighfed  graph  g  defined  over  S,  which  is  given  a-priori 
and  encodes  fhe  prior  knowledge.  In  fhe  following  we  denofe  by  W  fhe  weighfed  adjacency  mafrix  of  g 
and  by  Cg  the  set  of  all  binary  functions  over  S. 

Minimum  cut  Suppose  for  /  G  Cg  we  define  fhe  incompafibilify  of  /  fo  be  fhe  weighf  of  fhe  cuf  in  g 
defermined  by  /.  This  is  fhe  implicif  nofion  of  compafibilify  considered  in  [59|,  and  algorifhmically 
fhe  goal  is  fo  find  fhe  mosf  compafible  hypofhesis  fhaf  is  correcf  on  fhe  labeled  dafa,  which  can  be 
solved  efficienfly  using  nefwork  flow.  From  a  sample-complexify  poinf  of  view,  fhe  number  of 
labeled  examples  we  need  is  proporfional  fo  fhe  VC-dimension  of  fhe  class  of  hypofheses  fhaf  are 
af  leasf  as  compafible  as  fhe  fargef  function.  This  is  known  fo  be  O  (^)  [  150,  152],  where  k  is  fhe 
number  of  edges  cuf  by  c*  and  A  is  fhe  size  of  fhe  global  minimum  cuf  in  fhe  graph.  Also  nofe  fhaf 
fhe  Randomized  Mincuf  algorifhm  (considered  by  |67|),  which  is  an  exfension  of  fhe  basic  mincuf 
approach,  can  be  viewed  as  mofivafed  by  a  PAC-Bayes  sample  complexify  analysis  of  fhe  problem. 


36 


Normalized  Cut  For  f  £  Cs  define  size{f)  to  be  the  weight  of  the  cut  in  g  determined  by  /,  and  let 
neg{f)  and  pos{f)  be  the  number  of  points  in  S  on  which  /  predicts  negative  and  positive,  re¬ 
spectively.  For  the  normalized  cut  setting  of  |140|  we  can  define  the  incompatibility  of  /  G  Cs 
neglf)^pos{f) '  penalty  function  used  in  [140|,  and  again,  algorithmically  the  goal 

would  be  to  find  a  highly  compatible  hypothesis  that  is  correct  on  the  labeled  data.  Unfortunately, 
the  corresponding  optimization  problem  is  in  this  case  is  NP-hard.  Still,  several  approximate  solu¬ 
tions  have  been  considered,  leading  to  different  semi-supervised  learning  algorithms.  For  instance, 
Joachims  1 140|  considers  a  spectral  relaxation  that  leads  to  the  “SGT  algorithm”;  another  relaxation 
based  on  semidefinite  programming  is  considered  in  |56|. 

Harmonic  Function  We  can  also  model  the  algorithms  introduced  in  |215|  as  follows.  If  we  consider  / 
to  be  a  probabilistic  prediction  function  defined  over  S,  then  we  can  define  the  incompatibility  of  / 
to  be 

Ylwij  {f{i)  -  f{j)f  =  fLf, 
hi 

where  L  is  the  un-normalized  Laplacian  of  g.  Similarly  we  can  model  the  algorithm  introduced 
by  Zhao  et  al.  [213 1  by  using  an  incompatibility  of  /  given  by  f^Cf  where  £  is  the  normalized 
Laplacian  of  g.  More  generally,  all  the  Graph  Kernel  methods  can  be  viewed  in  our  framework  if 
we  consider  that  the  incompatibility  of  /  is  given  by  1 1/|  |x  =  f^Kf  where  K  is  a  kernel  derived 
from  the  graph  (see  for  instance  |216|). 

2.5.2  Connections  to  Generative  Models 

It  is  also  interesting  to  consider  how  generative  models  can  be  fit  into  our  model.  As  mentioned  in  Section 
2.L  a  typical  assumption  in  a  generative  setting  is  that  U  is  a  mixture  with  the  probability  density  function 
p{x\6)  =  po  ■  po{x\9o)  +  Pi  •  pi{x\6i)  (see  for  instance  [76,  77,  183|).  In  other  words,  the  labeled 
examples  are  generated  according  to  the  following  mechanism:  a  label  y  G  {0, 1}  is  drawn  according  to 
the  distribution  of  classes  {pD,pi}  and  then  a  corresponding  random  feature  vector  is  drawn  according 
to  the  class-conditional  density  py.  The  assumption  typically  used  is  that  the  mixture  is  identifiable. 
Identifiability  ensures  that  the  Bayes  optimal  decision  border  {x  :  po  ■  po{x\9ii)  =  pi  ■  pi{x\9i)'\  can 
be  deduced  if  p{x\9)  is  known,  and  therefore  one  can  construct  an  estimate  of  the  Bayes  border  by  using 
p{x\9)  instead  of  p{x\9).  Essentially  once  the  decision  border  is  estimated,  a  small  labeled  sample  suffices 
to  learn  (with  high  confidence  and  small  error)  the  appropriate  class  labels  associated  with  the  two  disjoint 
regions  generated  by  the  estimate  of  the  Bayes  decision  border.  To  see  how  we  can  incorporate  this  setting 
in  our  model,  consider  for  illustration  the  setting  in  |183|;  there  they  assume  that  po  =  and  that  the 
class  conditional  densities  are  d-dimensional  Gaussians  with  unit  covariance  and  unknown  mean  vectors 
9i  G  The  algorithm  used  is  the  following:  the  unknown  parameter  vector  9  =  {9o,9i)  is  estimated 
from  unlabeled  data  using  a  maximum  likelihood  estimate;  this  determines  a  hypothesis  which  is  a  linear 
separator  that  passes  through  the  point  {9o  +  9i)/2  and  is  orthogonal  to  the  vector  9i  —  9o',  finally  each 
of  the  two  decision  regions  separated  by  the  hyperplane  is  labeled  according  to  the  majority  of  the  labeled 
examples  in  the  region.  Given  this  setting,  a  natural  notion  of  compatibility  we  can  consider  is  the  expected 
log-likelihood  function  (where  the  expectation  is  taken  with  respect  to  the  unknown  distribution  specified 
by  9).  Specifically,  we  can  identify  a  legal  hypothesis  fg  with  the  set  of  parameters  9  =  {9o,9i)  that 
determine  it,  and  then  we  can  define  xilg^D)  =  Ea;g£)[log(p(x|0))].  [183]  show  that  if  the  unlabeled 
sample  is  large  enough,  then  all  hypotheses  specified  by  parameters  9  which  are  close  enough  to  9,  will 
have  the  property  that  their  empirical  compatibilities  will  be  close  enough  to  their  true  compatibilities. 
This  then  implies  (together  with  other  observations  about  Gaussian  mixtures)  that  the  maximum  likelihood 


37 


estimate  will  be  close  enough  to  9,  up  to  permutations.  (This  actually  motivates  x  a  good  compatibility 
function  in  our  model.) 

More  generally,  we  can  deal  with  other  parametric  families  using  the  same  compatibility  notion;  how¬ 
ever,  we  will  need  to  impose  constraints  on  the  distributions  allowed  in  order  to  ensure  that  the  compati¬ 
bility  is  actually  well  defined  (the  expected  log-likelihood  is  bounded). 

As  mentioned  in  Section  2.1,  this  kind  of  generative  setting  is  really  at  the  extreme  of  our  model. 
The  assumption  that  the  distribution  that  generates  the  data  is  truly  a  mixture  implies  that  if  we  knew  the 
distribution,  then  there  are  only  two  possible  concepts  left  (and  this  makes  the  unlabeled  data  extremely 
useful). 

2.5.3  Connections  to  the  Luckiness  Framework 

It  is  worth  noticing  that  there  is  a  strong  connection  between  our  approach  and  the  luckiness  frame¬ 
work  [  170,  191 1.  In  both  cases,  the  idea  is  to  define  an  ordering  of  hypotheses  that  depends  on  the  data, 
in  the  hope  that  we  will  be  “lucky”  and  find  fhat  the  target  function  appears  early  in  the  ordering.  There 
are  two  main  differences,  however.  The  first  is  that  the  luckiness  framework  (because  it  was  designed  for 
supervised  learning  only)  uses  labeled  data  both  for  estimating  compatibility  and  for  learning:  this  is  a 
more  difficult  task,  and  as  a  result  our  bounds  on  labeled  data  can  be  significantly  better.  For  instance, 
in  Example  4  described  in  Section  2.2,  for  any  non-degenerate  distribution,  a  dataset  of  ^  pairs  can  with 
probability  1  be  completely  shattered  by  fully-compatible  hypotheses,  so  the  luckiness  framework  does 
not  help.  In  contrast,  with  a  larger  (unlabeled)  sample,  one  can  potentially  reduce  the  space  of  compatible 
functions  quite  significantly,  and  learn  from  o{d)  or  even  0(1)  labeled  examples  depending  on  the  distri¬ 
bution  -  see  Section  2.3.2  and  Section  2.4  Secondly,  the  luckiness  framework  talks  about  compatibility 
between  a  hypothesis  and  a  sample,  whereas  we  define  compatibility  with  respect  to  a  distribution.  This 
allows  us  to  talk  about  the  amount  of  unlabeled  data  needed  to  estimate  true  compatibility.  There  are  also 
a  number  of  differences  at  the  technical  level  of  the  definitions. 

2.5.4  Relationship  to  Other  Ways  of  Using  Unlaheled  Data  for  Learning 

It  is  well  known  that  when  learning  under  an  unknown  distribution,  unlabeled  data  might  help  some¬ 
what  even  in  the  standard  discriminative  models  by  allowing  one  to  use  both  distribution-specific  algo¬ 
rithms  [53 1,  [144],  [194[  and/or  tighter  data  dependent  sample-complexity  bounds  [43,  155 [.  However  in 
all  these  methods  one  chooses  a  class  of  functions  or  a  prior  over  functions  before  performing  the  infer¬ 
ence.  This  does  not  capture  the  power  of  unlabeled  data  in  many  of  the  practical  semi-supervised  learning 
methods,  where  typically  one  has  some  idea  about  what  structure  of  the  data  tells  about  the  target  function, 
and  where  the  choice  of  prior  can  be  made  more  precise  after  seeing  the  unlabeled  data  [62,  141, 158,  184[. 
Our  focus  in  this  chapter  has  been  to  provide  a  unified  discriminative  framework  for  reasoning  about  use¬ 
fulness  of  unlabeled  data  in  such  settings  in  which  one  can  analyze  both  sample  complexity  and  algorith¬ 
mic  results. 

Another  learning  setting  where  unlabeled  data  is  useful  and  which  has  been  increasingly  popular  for 
the  past  few  years  is  Active  Learning  [30,  33,  34,  41,  86,  94 [.  Here,  the  learning  algorithm  has  both  the 
capability  of  drawing  random  unlabeled  examples  from  the  underlying  distribution  and  that  of  asking  for 
the  labels  of  any  of  these  examples,  and  the  hope  is  that  a  good  classifier  can  be  learned  with  significantly 
fewer  labels  by  actively  directing  the  queries  to  informative  examples.  Note  though  that  as  opposed 
to  the  Semi-supervised  learning  setting,  and  similarly  to  the  classical  supervised  learning  settings  (PAC 
and  Statistical  Learning  Theory  settings)  the  only  prior  belief  about  the  learning  problem  in  the  Active 
Learning  setting  is  that  the  target  function  (or  a  good  approximation  of  it)  belongs  to  a  given  concept 


38 


class.  Luckily,  it  turns  out  that  for  simple  concept  classes  such  as  linear  separators  on  the  line  one  can 
achieve  an  exponential  improvement  (over  the  usual  supervised  learning  setting)  in  the  labeled  data  sample 
complexity,  under  no  additional  assumptions  about  the  learning  problem  [30,  86].^  In  general,  however, 
for  more  complicated  concept  classes,  the  speed-ups  achievable  in  the  active  learning  setting  depend  on 
the  match  between  the  distribution  over  example-label  pairs  and  the  hypothesis  class,  and  therefore  on 
the  target  hypothesis  in  the  class.  We  discuss  all  these  further  as  well  as  our  contribution  on  the  topic  in 
Chapter  5 

Finally,  in  this  thesis,  we  present  in  the  context  of  learning  with  kernels  and  more  general  similarity 
functions  one  other  interesting  use  of  unlabeled  data  in  the  learning  process.  While  the  approach  of 
using  unlabeled  data  in  that  context  does  have  a  similar  flavor  to  the  approach  in  this  chapter,  the  final 
guarantees  and  learning  procedures  are  somewhat  different  from  those  presented  here.  In  that  case  the 
hypothesis  space  has  an  infinite  capacity  before  performing  the  inference.  In  the  training  process,  in  a 
first  stage,  we  first  use  unlabeled  in  order  to  extract  a  much  smaller  set  of  functions  with  the  property  that 
with  high  probability  the  target  is  well  approximated  by  one  the  functions  in  the  smaller  class.  In  a  second 
stage  we  then  use  labeled  examples  to  learn  well.  We  present  this  in  more  details  Chapter  3  in  Section  3.5 


2.6  Conclusions 


Given  the  easy  availability  of  unlabeled  data  in  many  settings,  there  has  been  growing  interest  in  meth¬ 
ods  that  try  to  use  such  data  together  with  the  (more  expensive)  labeled  data  for  learning.  Nonetheless, 
there  has  been  substantial  disagreement  and  no  clear  consensus  about  when  unlabeled  data  helps  and  by 
how  much.  In  our  work,  we  have  provided  a  PAC-style  model  for  semi- supervised  learning  that  captures 
many  of  the  ways  unlabeled  data  is  typically  used,  and  provides  a  very  general  framework  for  thinking 
about  this  issue.  The  high  level  implication  of  our  analysis  is  that  unlabeled  data  is  useful  if  (a)  we  have 
a  good  notion  of  compatibility  so  that  the  target  function  indeed  has  a  low  unlabeled  error  rate,  (b)  the 
distribution  D  is  helpful  in  the  sense  that  not  too  many  other  hypotheses  also  have  a  low  unlabeled  error 
rate,  and  (c)  we  have  enough  unlabeled  data  to  estimate  unlabeled  error  rates  well.  We  then  make  these 
statements  precise  through  a  series  of  sample-complexity  results,  giving  bounds  as  well  as  identifying  the 
key  quantities  of  interest.  In  addition,  we  give  several  efficient  algorithms  for  learning  in  this  framework. 
One  consequence  of  our  model  is  that  if  the  target  function  and  data  distribution  are  both  well  behaved 
with  respect  to  the  compatibility  notion,  then  the  sample-size  bounds  we  get  can  substantially  beat  what 
one  could  hope  to  achieve  using  labeled  data  alone,  and  we  have  illustrated  this  with  a  number  of  examples 
throughout  the  chapter. 

2.6.1  Subsequent  Work 

Following  the  initial  publication  of  this  work,  several  authors  have  used  our  framework  for  reasoning 
about  semi-supervised  learning,  as  well  as  for  developing  new  algorithms  and  analyses  of  semi-supervised 
learning.  For  example  [114,  184,  189[  use  it  in  the  context  of  agreement-based  multi- view  learning  for 
either  classification  with  specific  convex  loss  functions  (e.g.,  hinge  loss)  or  for  regression.  Sridharan  and 
Kakade  [  196[  use  our  framework  in  order  to  provide  a  general  analysis  multi-view  learning  for  a  variety 
of  loss  functions  and  learning  tasks  (classification  and  regression)  along  with  characterizations  of  suitable 
notions  of  compatibility  functions.  Parts  of  this  work  appear  as  a  book  chapter  in  [  82  [  and  as  stated  in  the 

^For  this  simple  concept  class  one  can  achieve  a  pure  exponential  improvement  [86|  in  the  realizable  case,  while  in  the 
agnostic  case  the  improvement  depends  upon  the  noise  rate  |30|. 


39 


introduction  of  that  book,  our  framework  can  be  used  to  obtain  bounds  for  a  number  of  the  semi-supervised 
learning  methods  used  in  the  other  chapters. 

2.6.2  Discussion 

Our  work  brings  up  a  number  of  open  questions,  both  specific  and  high-level.  One  broad  category  of  such 
questions  is  for  what  natural  classes  C  and  compatibility  notions  x  can  one  provide  an  efficient  algorithm 
that  PACuri; -learns  the  pair  (C,  x)'-  i-c.,  an  algorithm  whose  running  time  and  sample  sizes  are  polynomial 
in  the  bounds  of  Theorem  2.3.1  ?  For  example,  a  natural  question  of  this  form  is:  can  one  generalize  the 
algorithm  of  Section  2.4. 1  to  allow  for  irrelevant  variables  that  are  neither  positive  nor  negative  indicators? 
That  is,  suppose  we  define  a  “two-sided  disjunction”  to  be  a  pair  of  disjunctions  /i_)  where  h  is 
compatible  with  D  iff  for  all  examples  x,  hj^{x)  =  —h-{x)  (and  let  us  define  h{x)  =  /i+(x)).  Can  we 
efficiently  learn  the  class  of  two-sided  disjunctions  under  this  notion  of  compatibility? 

Alternatively,  as  a  different  generalization  of  the  problem  analyzed  in  Section  2.4. 1 ,  suppose  that  again 
every  variable  is  either  a  positive  or  negative  indicator,  but  we  relax  the  “margin”  condition.  In  particular, 
suppose  we  require  that  every  example  x  either  contain  at  least  60%  of  the  positive  indicators  and  at 
most  40%  of  the  negative  indicators  (for  positive  examples)  or  vice  versa  (for  negative  examples).  Can 
this  class  be  learned  efficiently  with  bounds  comparable  to  those  from  Theorem  2.3.1  ?  Along  somewhat 
different  lines,  can  one  generalize  the  algorithm  given  for  Co-Training  with  linear  separators,  to  assume 
some  condition  weaker  than  independence  given  the  label,  while  maintaining  computational  efficiency? 


40 


Chapter  3 


A  General  Theory  of  Learning  with 
Similarity  Functions 

3.1  Learning  with  Kernel  Functions.  Introduction 

Kernel  functions  have  become  an  extremely  popular  tool  in  machine  learning,  with  an  attractive  theory 
as  well  [1,  133,  139,  187,  190,  203|.  A  kernel  is  a  function  that  takes  in  two  data  objects  (which  could 
be  images,  DNA  sequences,  or  points  in  72”)  and  outputs  a  number,  with  the  property  that  the  function 
is  symmetric  and  positive-semidefinite.  That  is,  for  any  kernel  K,  there  must  exist  an  (implicit)  mapping 
(j),  such  that  for  all  inputs  x,x'  we  have  K{x,x')  =  ((/>(x),  (/)(x')).  The  kernel  is  then  used  inside  a 
“kemelized”  learning  algorithm  such  as  SVM  or  kemel-perceptron  in  place  of  direct  access  to  the  data. 
Typical  kernel  functions  for  structured  data  include  the  polynomial  kernel  K{x,  x')  =  (1  +  x  •  x'Y  and  the 
Gaussian  kernel  ^(x,  x')  =  ^  ,  and  a  number  of  special-purpose  kernels  have  been  developed 

for  sequence  data,  image  data,  and  other  types  of  data  as  well  |88,  89,  157,  173,  193|. 

The  theory  behind  kernel  functions  is  based  on  the  fact  that  many  standard  algorithms  for  learning 
linear  separators,  such  as  SVMs  [203 1  and  the  Perceptron  [110]  algorithm,  can  be  written  so  that  the  only 
way  they  interact  with  their  data  is  via  computing  dot-products  on  pairs  of  examples.  Thus,  by  replacing 
each  invocation  of  (x,  x')  with  a  kernel  computation  iT(x,  x'),  the  algorithm  behaves  exactly  as  if  we  had 
explicitly  performed  the  mapping  4>{x),  even  though  cj)  may  be  a  mapping  into  a  very  high-dimensional 
space.  Furthermore,  these  algorithms  have  learning  guarantees  that  depend  only  on  the  margin  of  the  best 
separator,  and  not  on  the  dimension  of  the  space  in  which  the  data  resides  [18,  191  [.  Thus,  kernel  functions 
are  often  viewed  as  providing  much  of  the  power  of  this  implicit  high-dimensional  space,  without  paying 
for  it  either  computationally  (because  the  cj)  mapping  is  only  implicit)  or  in  terms  of  sample  size  (if  data  is 
indeed  well-separated  in  that  space). 

While  the  above  theory  is  quite  elegant,  it  has  a  few  limitations.  When  designing  a  kernel  function 
for  some  learning  problem,  the  intuition  employed  typically  does  not  involve  implicit  high-dimensional 
spaces  but  rather  that  a  good  kernel  would  be  one  that  serves  as  a  good  measure  of  similarity  for  the  given 
problem  [  187[.  So,  in  this  sense  the  theory  is  not  always  helpful  in  providing  intuition  when  selecting  or 
designing  a  kernel  function  for  a  particular  learning  problem.  Additionally,  it  may  be  that  the  most  natural 
similarity  function  for  a  given  problem  is  not  positive-semidefinite^  and  it  could  require  substantial  work, 
possibly  reducing  the  quality  of  the  function,  to  coerce  it  into  a  “legal”  form.  Finally,  it  is  a  bit  unsatisfying 
for  the  explanation  of  the  effectiveness  of  some  algorithm  to  depend  on  properties  of  an  implicit  high- 

'This  is  very  common  in  the  context  of  Computational  Biology  where  the  most  natural  measures  of  alignment  between 
sequences  are  not  legal  kernels.  For  more  examples  see  Section  3.2 


41 


dimensional  mapping  that  one  may  not  even  be  able  to  calculate.  In  particular,  the  standard  theory  at 
first  blush  has  a  “something  for  nothing”  feel  to  it  (all  the  power  of  the  implicit  high-dimensional  space 
without  having  to  pay  for  it)  and  perhaps  there  is  a  more  prosaic  explanation  of  what  it  is  that  makes  a 
kernel  useful  for  a  given  learning  problem.  For  these  reasons,  it  would  be  helpful  to  have  a  theory  that 
was  in  terms  of  more  tangible  quantities. 

In  this  chapter,  we  develop  a  theory  of  learning  with  similarity  functions  that  addresses  a  number  of 
these  issues.  In  particular,  we  define  a  nofion  of  whaf  if  means  for  a  pairwise  function  K{x,  x')  fo  be  a 
“good  similarify  funcfion”  for  a  given  learning  problem  fhaf  (a)  does  nol  require  fhe  nofion  of  an  implicif 
space  and  allows  for  funclions  fhaf  are  nol  positive  semi-definile,  (b)  we  can  show  is  sufficienl  lo  be  used 
for  learning,  and  (c)  strictly  generalizes  fhe  sfandard  fheory  in  fhaf  a  good  kernel  in  fhe  usual  sense  (large 
margin  in  fhe  implicif  c/i-space)  will  also  salisfy  our  definifion  of  a  good  similarify  funcfion.  In  fhis  way, 
we  provide  fhe  firsl  fheory  fhaf  describes  fhe  effecliveness  of  a  given  kernel  (or  more  general  similarify 
funcfion)  in  lerms  of  nalural  similarily-based  properties. 

More  generally,  our  framework  provides  a  formal  way  lo  analyze  properties  of  a  similarify  function 
fhaf  make  if  sufficienl  for  learning,  as  well  as  whaf  algorilhms  are  suited  for  a  given  properly.  Nole  fhaf 
while  our  work  is  mofivaled  by  extending  fhe  sfandard  large-margin  nofion  of  a  good  kernel  funcfion, 
we  expecf  one  can  use  fhis  framework  fo  analyze  olher,  nof  necessarily  comparable,  properlies  fhaf  are 
sufficienl  for  learning  as  well.  In  facl,  recenl  work  along  Ihese  lines  is  given  in  [208  |. 

Structure  of  this  chapter:  We  start  with  background  and  notation  in  Section  3.2  We  the  present  a  first 
notion  of  a  good  similarity  function  in  Section  3.3  and  analyze  its  relationship  with  the  usual  notion  of  a 
good  kernel  function.  (These  results  appear  in  [24|  and  [38 1.)  In  section  3.4  we  present  a  slightly  different 
and  broader  notion  that  we  show  provides  even  better  kernels  to  similarity  translation;  in  Section  3.4.3  we 
give  a  separation  result,  showing  that  this  new  notion  is  strictly  more  general  than  the  notion  of  a  large 
margin  kernel.  (These  results  appear  in  [39].) 


3.2  Background  and  Notation 

We  consider  a  learning  problem  specified  as  follows.  We  are  given  access  to  labeled  examples  (x,  y) 
drawn  from  some  distribution  P  over  X  X  {-1,  1},  where  X  is  an  abstract  instance  space.  The  objec¬ 
tive  of  a  learning  algorithm  is  to  produce  a  classification  function  g  :  X  — >  {—1,1}  whose  error  rate 
^^{x,y)r~.p[g{x)  /  y]  is  low.  We  will  consider  learning  algorithms  that  only  access  the  points  x  through  a 
pairwise  similarity  function  K (x,  x')  mapping  pairs  of  points  to  numbers  in  the  range  [—1,1].  Specifically, 

Definition  3.2.1  A  similarity  function  over  X  is  any  pairwise  function  K  :  X  x  X  ^  [—1,  Ij.  We  say 
that  K  is  a  symmetric  similarity  function  if  K{x,  x')  =  K{x\  x)for  all  x,  x' . 

A  similarity  function  AT  is  a  valid  (or  legal)  kernel  function  if  it  is  positive-semidefinite,  i.e.  there 
exists  a  function  f  from  the  instance  space  X  into  some  (implicit)  Hilbert  “i?i-space”  such  that 

K{x,x)  =  {(f){x),(j){x)). 

See,  e.g.,  Smola  and  Scholkopf  [  186  j  for  a  discussion  on  conditions  for  a  mapping  being  a  kernel  function. 
Throughout  this  chapter,  and  without  loss  of  generality,  we  will  only  consider  kernels  such  that  iT(x,  x)  < 
1  for  all  X  G  A".  Any  kernel  K  can  be  converted  into  this  form  by,  for  instance,  defining 

K{x,x')  =  iT(x,  x')/y/iT(x,  x)K{x' ,  x'). 


42 


We  say  that  K  is  (e,  'y)-kemel  good  for  a  given  learning  problem  P  if  there  exists  a  vector  /3  in  the  cp-space 
that  has  error  e  at  margin  7;  for  simplicity  we  consider  only  separators  through  the  origin.  Specifically:^ 
Definition  3.2.2  K  is  (e,  7)-kernel  good  if  there  exists  a  vector  /3,  ||/3||  <  1  such  that 

Pr  [y{f{x),P)  >  7]  >  1  -  e. 

{x,y)r^P 


We  say  that  K  is  ^-kernel  good  if  it  is  (e,  '~t)-kemel  good  for  e  =  0;  i.e.,  it  has  zero  error  at  margin  7. 

Given  a  kernel  that  is  (e,  7) -kernel-good  for  some  learning  problem  P,  a  predictor  with  error  rate  at 
most  e  -|-  eacc  can  be  learned  (with  high  probability)  from  a  sample  of^  0[{e  +  eacc)/(7^cLc))  examples 
(drawn  independently  from  the  source  distribution)  by  minimizing  the  number  of  margin  7  violations 
on  the  sample  [168|.  However,  minimizing  the  number  of  margin  violations  on  the  sample  is  a  difficult 
optimization  problem  |18,  20 1.  Instead,  it  is  common  to  minimize  the  so-called  hinge  loss  relative  to  a 
margin. 

Definition  3.2.3  We  say  that  K  is  (e,  y)-kernel  good  in  hinge-loss  if  there  exists  a  vector  f3,  1 1/3|  |  <  1  such 
that 

E(x,y)~p[[l  -y{PA{x))/l]+]  <  e, 
where  [1  —  2;]+  =  max(l  —  z,  0)  is  the  hinge  loss. 

Given  a  kernel  that  is  (e,  7)-kemel-good  in  hinge-loss,  a  predictor  with  error  rate  at  most  e  -|-  eacc  can 
be  efficiently  learned  (with  high  probability)  from  a  sample  of  O {l / examples  by  minimizing 
the  average  hinge  loss  relative  to  margin  7  on  the  sample  [43|. 

We  end  this  section  by  noting  that  a  general  similarity  function  might  not  be  a  legal  (valid)  kernel.  To 
illustrate  this  we  provide  a  few  examples  in  the  following. 

Examples  of  similarity  functions  which  are  not  legal  kernel  functions.  As  a  simple  example,  let 
us  consider  a  document  classification  task  and  let  us  assume  we  have  a  similarity  function  K  such  that 
two  documents  have  similarity  1  if  they  have  either  an  author  in  common  or  a  keyword  in  common,  and 
similarity  0  otherwise.  Then  we  could  have  three  documents  A,  B,  and  C,  such  that  K {A,  B)  =  1  because 
A  and  B  have  an  author  in  common,  K{B,  C)  =  1  because  B  and  C  have  a  keyword  in  common,  but 
K{A,C)  =  0  because  A  and  C  have  neither  an  author  nor  a  keyword  in  common  (and  K{A,A)  = 
K{B,  B)  =  K{C,  C)  =  1).  On  the  other  hand,  a  kernel  requires  that  if  (t>{A)  and  i?i(H)  are  of  unit  length 
and  {(t){A),  (t>{B))  =  1,  then  ^(A)  =  (j){B),  so  this  could  not  happen  if  K  was  a  valid  kernel. 

Similarity  functions  that  are  not  legal  kernels  are  common  in  the  context  of  computational  biol¬ 
ogy  [  160|;  standard  examples  include  various  measures  of  alignment  between  sequences  such  as  BLAST 
scores  for  protein  sequences  or  for  DNA.  Finally,  one  other  natural  example  of  a  similarity  function  that 
might  not  be  a  legal  kernel  (and  which  might  not  be  even  symmetric)  is  the  following:  consider  a  trans- 
ductive  setting  (where  we  have  all  the  points  we  want  to  classify  in  advance)  and  assume  we  have  a 
base  distance  function  d{x,  x').  Let  us  define  K(x,  x')  as  fhe  percentile  rank  of  x'  in  disfance  fo  x  (i.e., 
K{x,  x')  =  Pr  [d{x,  x')  <  d{x,  x")] ;  fhen  clearly  K  mighf  nol  be  a  legal  kernel  since  in  facl  if  mighf  nof 
even  be  a  symmefric  similarify  funcfion. 

Of  course,  one  could  modify  such  a  funcfion  fo  be  positive  semidefinife,  e.g.,  by  blowing  up  fhe 
diagonal  or  by  using  ofher  relafed  mefhods  suggesfed  in  fhe  liferafure  |166|,  buf  none  of  fhese  mefhods 
have  a  formal  guaranfee  on  fhe  final  generalizafion  bound  (and  fhese  mefhods  mighf  significanfly  decrease 
fhe  “dynamic  range”  of  K  and  yield  a  very  small  margin). 

^  Note  that  we  are  distinguishing  between  what  is  needed  for  a  similarity  function  to  be  a  valid  or  legal  kernel  function 
(symmetric  and  positive  semidefinite)  and  what  is  needed  to  be  a  good  kernel  function  for  a  learning  problem  (large  margin). 

^The  (5(  )  notations  hide  logarithmic  factors  in  the  arguments,  and  in  the  failure  probability. 


43 


3.3  Learning  with  More  General  Similarity  Functions:  A  First  Attempt 


Our  goal  is  to  describe  “goodness”  properties  that  are  sufficient  for  a  similarity  function  to  allow  one  to 
learn  well  that  ideally  are  intuitive  and  subsume  the  usual  notion  of  good  kernel  function.  Note  that  as 
with  the  theory  of  kernel  functions  1 186|,  “goodness”  is  with  respect  to  a  given  learning  problem  P,  and 
not  with  respect  to  a  class  of  target  functions  as  in  the  PAC  framework  [  149,  201 1. 

We  start  by  presenting  here  the  notion  of  good  similarity  functions  introduced  in  1 24 1  and  further  ana¬ 
lyzed  in  1 195|  and  |38|,  which  throughout  the  chapter  we  call  the  Balcan  -  Blum’06  definition.  We  begin 
with  a  definition  (Definition  3.3.1)  that  is  especially  intuitive  and  allows  for  learning  via  a  very  simple 
algorithm,  but  is  not  broad  enough  to  include  all  kernel  functions  that  induce  large-margin  separators.  We 
then  broaden  this  notion  to  the  main  definition  in  [24|  (Definition  3.3.5)  that  requires  a  more  involved 
algorithm  to  learn,  but  is  now  able  to  capture  all  functions  satisfying  the  usual  notion  of  a  good  kernel 
function.  Specifically,  we  show  fhat  if  K  is  a  similarity  function  satisfying  Definition  3.3.5  then  one 
can  algorithmically  perform  a  simple,  explicit  transformation  of  the  data  under  which  there  is  a  low-error 
large-margin  separator.  We  also  consider  variations  on  this  definition  (e.g..  Definition  3.3.6)  that  produce 
better  guarantees  on  the  quality  of  the  final  hypothesis  when  combined  with  existing  learning  algorithms. 

A  similarity  function  K  satisfying  the  Balcan  -  Blum’06  definition,  but  that  is  not  positive  semi- 
definite,  is  not  necessarily  guaranteed  to  work  well  when  used  directly  in  standard  learning  algorithms 
such  as  SVM  or  the  Perceptron  algorithm^.  Instead,  what  we  show  is  that  such  a  similarity  function 
can  be  employed  in  the  following  two-stage  algorithm.  First,  re-represent  that  data  by  performing  what 
might  be  called  an  “empirical  similarity  map”:  selecting  a  subset  of  data  points  as  landmarks,  and  then 
representing  each  data  point  using  the  similarities  to  those  landmarks.  Then,  use  standard  methods  to  find 
a  large-margin  linear  separator  in  the  new  space.  One  property  of  this  approach  is  that  it  allows  for  the  use 
of  a  broader  class  of  learning  algorithms  since  one  does  not  need  the  algorithm  used  in  the  second  step  to 
be  “kernalizable”.  In  fact,  the  work  in  this  chapter  is  motivated  by  work  on  a  re -representation  method  that 
algorithmically  transforms  a  kernel-based  learning  problem  (with  a  valid  positive-semidefinite  kernel)  to 
an  explicit  low-dimensional  learning  problem  |31 1.  (We  present  this  Chapter  6  ) 

Deterministic  Labels:  For  simplicity  in  presentation,  for  most  of  this  section  we  will  consider  only 
learning  problems  where  the  label  y  is  a  deterministic  function  of  x.  For  such  learning  problems,  we  can 
use  y{x)  to  denote  the  label  of  point  x,  and  we  will  use  x  ~  P  as  shorthand  for  (x,  y(x))  ~  P.  We  will 
return  to  learning  problems  where  the  label  y  may  be  a  probabilistic  function  of  x  in  Section  3.3.5 

3.3.1  Sufficient  Conditions  for  Learning  with  Similarity  Functions 

We  now  provide  a  series  of  sufficient  conditions  for  a  similarity  function  to  be  useful  for  learning,  leading 
to  the  notions  given  in  Definitions  3.3.5  and  3.3.6 

3.3.2  Simple  Sufficient  Conditions 

We  begin  with  our  first  and  simplest  notion  of  “good  similarity  function”  that  is  intuitive  and  yields 
an  immediate  learning  algorithm,  but  which  is  not  broad  enough  to  capture  all  good  kernel  functions. 
Nonetheless,  it  provides  a  convenient  starting  point.  This  definition  says  that  AT  is  a  good  similarity 
function  for  a  learning  problem  P  if  most  examples  x  (at  least  a  1  —  e  probability  mass)  are  on  average  at 
least  7  more  similar  to  random  examples  x'  of  the  same  label  than  they  are  to  random  examples  x'  of  the 
opposite  label.  Formally, 

''However,  as  we  will  see  in  Section  3.3.5,  if  the  function  is  positive  semi-definite  and  if  it  is  good  in  the  Balcan  - 
Blum’06  sense  |24  38 1,  or  in  the  Balcan  -  Blum  -  Srebro’08  sense  |39|,  then  we  can  show  it  is  good  as  a  kernel  as  well. 


44 


Definition  3.3.1  K  is  a  strongly  (e,  7)-good  similarity  function /or  a  learning  problem  P  if  at  least  a 
1  —  e  probability  mass  of  examples  x  satisfy: 

E^^^p[K{x,x')\y{x)  =  y{x')]  >  E^/^p[K{x,  x')\y{x)  /  y{x)]  +7.  (3.1) 

For  example,  suppose  all  positive  examples  have  similarity  at  least  0.2  with  each  other,  and  all  negative 
examples  have  similarity  at  least  0.2  with  each  other,  but  positive  and  negative  examples  have  similarities 
distributed  uniformly  at  random  in  [—1, 1].  Then,  this  would  satisfy  Definition  3.3.1  for  7  =  0.2  and 
e  =  0.  Note  that  with  high  probability  this  would  not  be  positive  semidefinite.^ 

Definition  3.3.1  captures  an  intuitive  notion  of  what  one  might  want  in  a  similarity  function.  In  ad¬ 
dition,  if  a  similarity  function  K  satisfies  Definition  3.3.1  then  it  suggests  a  simple,  natural  learning 
algorithm:  draw  a  sufficiently  large  set  of  positive  examples  and  set  S~  of  negative  examples,  and 
then  output  the  prediction  rule  that  classifies  a  new  example  x  as  positive  if  it  is  on  average  more  similar 
to  points  in  than  to  points  in  S~ ,  and  negative  otherwise.  Formally: 

Theorem  3.3.1  If  K  is  strongly  {€,y)-good,  then  a  set  0/ (16/7^)  ln(2/(i)  positive  examples  and  a 
set  S~  of  (16/7^)  ln(2/(5)  negative  examples  are  sufficient  so  that  with  probability  >1  —  5,  the  above 
algorithm  produces  a  classifier  with  error  at  most  e  +  5. 

Proof:  Let  Good  be  the  set  of  x  satisfying 

Ep^p[K{x,x)\y{x)  =  y{x)\  >  Ep^p[K{x,x)\y{x)  /  y{x)\  +7. 

So,  by  assumption,  Pix^p[x  G  Good]  >  1  —  e.  Now,  fix  x  G  Good.  Since  K{x,x')  G  [—1,1],  by 
Hoeffding  bounds  we  have  that  over  the  random  draw  of  the  sample  , 

Ev{\E^,^s+[K{x,x')]-E^,^p[K{x,x')\y{x')  =  1]|  >  7/2)  <  2e-2|^+lT'Vi6^ 

and  similarly  for  S~ .  By  our  choice  of  jS"'']  and  \S~\,  each  of  these  probabilities  is  at  most  5^/2. 

So,  for  any  given  x  G  Good,  there  is  at  most  a  5^  probability  of  error  over  the  draw  of  S'+  and  S  . 
Since  this  is  true  for  any  x  G  Good,  it  implies  that  the  expected  error  of  this  procedure,  over  x  G  Good, 
is  at  most  5^,  which  by  Markov’s  inequality  implies  that  there  is  at  most  a  6  probability  that  the  error  rate 
over  Good  is  more  than  6.  Adding  in  the  e  probability  mass  of  points  not  in  Good  yields  the  theorem.  ■ 


Before  going  to  our  main  notion  note  that  Definition  3.3.1  requires  that  almost  all  of  the  points  (at 
least  a  1  —  e  fraction)  be  on  average  more  similar  to  random  points  of  the  same  label  than  to  random  points 
of  the  other  label.  A  weaker  notion  would  be  simply  to  require  that  two  random  points  of  the  same  label 
be  on  average  more  similar  than  two  random  points  of  different  labels.  For  instance,  one  could  consider 
the  following  generalization  of  Definition  3.3.1: 

Definition  3.3.2  K  is  a  weakly  7-good  similarity  function /or  a  learning  problem  P  if: 

E^^Pr^p[K{x,x')\y{x)  =  y{x')]  >  Ea;^x'r^p[K{x,x')\y{x)  y{x')]  +  y.  (3.2) 

While  Definition  3.3.2  still  captures  a  natural  intuitive  notion  of  what  one  might  want  in  a  similarity 
function,  it  is  not  powerful  enough  to  imply  strong  learning  unless  7  is  quite  large.  For  example,  suppose 
the  instance  space  is  and  that  the  similarity  measure  K  we  are  considering  is  just  the  product  of  the  first 
coordinates  (i.e.,  dot-product  but  ignoring  the  second  coordinate).  Assume  the  distribution  is  half  positive 

^In  particular,  if  the  domain  is  large  enough,  then  with  high  probability  there  would  exist  negative  example  A  and  positive 
examples  B,  C  such  that  K{A,  B)  is  close  to  1  (so  they  are  nearly  identical  as  vectors),  K{A,  C)  is  close  to  —1  (so  they  are 
nearly  opposite  as  vectors),  and  yet  K{B,  C)  >  0.2  (their  vectors  form  an  acute  angle). 


45 


and  half  negative,  and  that  75%  of  the  positive  examples  are  at  position  (1,1)  and  25%  are  at  position 
(— 1, 1),  and  75%  of  the  negative  examples  are  at  position  (—1,  —1)  and  25%  are  at  position  (1,  —1). 
Then  itT  is  a  weakly  7-good  similarity  funetion  for  7  =  1  /2,  but  the  best  accuracy  one  can  hope  for  using 
K  is  75%  because  that  is  the  accuracy  of  the  Bayes-optimal  predictor  given  only  the  first  coordinate. 

We  can  however  show  that  for  any  7  >  0,  Definition  3.3.2  is  enough  to  imply  weak  learning  1 188|.  In 
particular,  the  following  simple  algorithm  is  sufficient  to  weak  learn.  First,  determine  if  the  distribution  is 
noticeably  skewed  towards  positive  or  negative  examples:  if  so,  weak-learning  is  immediate  (output  all¬ 
positive  or  all-negative  respectively).  Otherwise,  draw  a  sufficiently  large  set  S~^  of  positive  examples  and 
set  S~  of  negative  examples.  Then,  for  each  x,  consider  j{x)  =  ^  [Ea./g5+  [K{x,x')]  -  E^,^s-[K{x,x% 
Finally,  to  classify  x,  use  the  following  probabilistic  prediction  rule:  classify  x  as  positive  with  probability 
and  as  negative  with  probability  (Notice  that  ^{x)  G  [—1,1]  and  so  our  algorithm  is  well 

defined.)  We  can  fhen  prove  fhe  following  resulf: 

Theorem  3.3.2  If  K  is  a  weakly  y-good  similarity  function,  then  with  probability  at  least  1  —  5,  the  above 
algorithm  using  sets  S~^,  S~  of  size  p  In  (^)  yields  a  classifier  with  error  at  most  1  — 

Proof:  Firsf,  we  assume  fhe  algorifhm  inifially  draws  a  sufficienlly  large  sample  such  fhaf  if  fhe  disfri- 
bufion  is  skewed  wifh  probabilify  mass  greafer  fhan  1  -|-  a  on  positives  or  negatives  for  a  =  fhen 
wifh  probabilify  af  leasf  1  —  5 {‘I  fhe  algorifhm  nofices  fhe  bias  and  weak- learns  immediately  (and  if  fhe 
disfribufion  is  less  skewed  fhan  1  ±  wifh  probabilify  1  —  (5/2  if  does  nof  incorrecfly  half  in  fhis  sfep). 
In  fhe  following,  fhen,  we  may  assume  fhe  disfribufion  P  is  less  fhan  (^  +  a) -skewed,  and  lef  us  define 
P'  fo  be  P  reweighfed  fo  have  probabilify  mass  exacfly  1/2  on  posifive  and  negafive  examples.  Thus, 
Definition  3.3.2  is  satisfied  for  P'  wifh  margin  af  leasf  7  —  4a. 

For  each  x  define  'y(x)  as  ^Exi[K{x,x')\y{x')  =  1]  —  \Exi[K{x,x')\y{x')  =  —1]  and  notice  fhaf 
Definition  3.3.2  implies  fhaf  [y[x)'y{x)]  >  7/2  —  2a.  Consider  now  fhe  probabilistic  prediction 

function  g  defined  as  g{x)  =  1  wifh  probabilify  and  g{x)  =  —1  wifh  probabilify  We 

clearly  have  fhaf  for  a  fixed  x, 

Pr(p(x)  /  y{x))  =  - - - , 

9  2 

which  fhen  implies  fhaf  ,g{g{x)  /  y{x))  <  ^  —  ^7  —  a.  Now  notice  fhaf  in  our  algorifhm  we 

do  nof  use  y{x)  buf  an  estimate  of  if  y{x),  and  so  fhe  lasf  sfep  of  fhe  proof  is  fo  argue  fhaf  fhis  is  good 
enough.  To  see  fhis,  nofice  firsf  fhaf  d  is  large  enough  so  fhaf  for  any  fixed  x  we  have 

^Pr_  (|7(x)  -  7(x)|  >  I  -  2a)  <  1^. 

This  implies 

so 

sfy  (t-p  S  I  -  2a)  >  <  s/2. 

This  further  implies  fhaf  wifh  probabilify  af  leasf  1— (5/2  we  have  Ea,..^p/  [?/(a:)7(x)]  >  (l  —  2)^  > 

Finally  using  a  reasoning  similar  fo  fhe  one  above  (concerning  fhe  probabilisfic  predicfion  function 
based  on  7(x)),  we  obfain  fhaf  wifh  probabilify  af  leasf  1  —  <5/2  fhe  error  of  fhe  probabilisfic  classifier 
based  on  y{x)  is  af  mosf  ^  ^  on  P',  which  implies  fhe  error  over  P  is  af  mosf  |  ^  +  a  =  ^  — 


46 


Figure  3.1:  Positives  are  split  equally  among  upper-left  and  upper-right.  Negatives  are  all  in  the  lower- 
right.  For  a  =  30°  (so  7  =  1/2)  a  large  fraction  of  the  positive  examples  (namely  the  50%  in  the 
upper-right)  have  a  higher  dot-product  with  negative  examples  ( | )  than  with  a  random  positive  example 
(2  ■  ^  ~  i)-  However,  if  we  assign  the  positives  in  the  upper-left  a  weight  of  0,  those  in  the 

upper-right  a  weight  of  1,  and  assign  negatives  a  weight  of  then  all  examples  have  higher  average 
weighted  similarity  to  those  of  the  same  label  than  to  those  of  the  opposite  label,  by  a  gap  of 

Returning  to  Definition  3.3.1,  Theorem  3.3.1  implies  that  if  iT  is  a  strongly  (e,7)-good  similarity 
function  for  small  e  and  not-too-small  7,  then  it  can  be  used  in  a  natural  way  for  learning.  However, 
Definition  3.3.1  is  not  sufficient  to  capture  all  good  kernel  functions.  In  particular.  Figure  3.1  gives  a 
simple  example  in  7^^  where  the  standard  kernel  K{x,  x')  =  (x,  x')  has  a  large  margin  separator  (margin 
of  1  /2)  and  yet  does  not  satisfy  Definition  3.3.1 ,  even  for  7  =  0  and  e  =  0.24. 

Notice,  however,  that  if  in  Figure  3.1  we  simply  ignored  the  positive  examples  in  the  upper-left  when 
choosing  x' ,  and  down-weighted  the  negative  examples  a  bit,  then  we  would  be  fine.  This  fhen  mofivafes 
fhe  following  infermediafe  nofion  of  a  similarify  funclion  K  being  good  under  a  weighfing  funclion  w 
over  fhe  inpuf  space  fhaf  can  downweighf  cerfain  porfions  of  fhaf  space. 

Definition  3.3.3  A  similarity  function  K  together  with  a  bounded  weighting  function  w  over  X  (specifi¬ 
cally,  w{x')  G  [0, 1]  for  all  x'  ^  X)  is  a  strongly  (e,  7)-good  weighted  similarity  function/or  a  learning 
problem  P  if  at  least  a  1  —  e  probability  mass  of  examples  x  satisfy: 

^x'r,^p[w{x')K(x,x')\y(x)  =y{x)\  >  P^:^'^p[w(x')K(x,x)\y{x)  ^  y{x')]  + y.  (3.3) 

We  can  view  Definition  3.3.3  intuitively  as  saying  that  we  only  require  most  examples  be  substantially 
more  similar  on  average  to  representative  points  of  the  same  class  than  to  representative  points  of  the 
opposite  class,  where  “representativeness”  is  a  score  in  [0, 1]  given  by  the  weighting  function  w.  A  pair 
(K,  w)  satisfying  Definition  3.3.3  can  be  used  in  exactly  the  same  way  as  a  similarity  function  K  satisfying 
Definition  3.3.1 ,  with  the  exact  same  proof  used  in  Theorem  3.3.1  (except  now  we  view  w{y)K(x,  x')  as 
the  bounded  random  variable  we  plug  into  Hoeffding  bounds). 

3.3.3  Main  Balcan  -  Blum’06  Conditions 

Unfortunately,  Definition  3.3.3  requires  the  designer  to  construct  both  K  and  w,  rather  than  just  K.  We 
now  weaken  the  requirement  to  ask  only  that  such  a  w  exist,  in  Definition  3.3.4  below: 

Definition  3.3.4  (Main  Balcan  -  Blum’06  Definition,  Balanced  Version)  A  similarity  function  K  is  an 
(e,  7)-good  similarity  function /or  a  learning  problem  P  if  there  exists  a  bounded  weighting  function  w 
over  X  (w{x')  G  [0, 1]  for  all  x'  €  X)  such  that  at  least  al  —  e  probability  mass  of  examples  x  satisfy: 

P-Pr,,p[w(x)K(x,x')\y(x)  =y(x)\  >  P^p^p[w(x')K(x,x)\y(x)  ^  y{x')]  + y.  (3.4) 


47 


As  mentioned  above,  the  key  difference  is  that  whereas  in  Definition  3.3.3  one  needs  the  designer 
to  construct  both  the  similarity  function  K  and  the  weighting  function  w,  in  Definition  3.3.4  we  only 
require  that  such  a  w  exist,  but  it  need  not  be  known  a-priori.  That  is,  we  ask  only  that  there  exist  a 
large  probability  mass  of  “representative”  points  (a  weighting  scheme)  satisfying  Definition  3.3.3,  but  the 
designer  need  not  know  in  advance  what  that  weighting  scheme  should  be. 

Definition  3.3.4  can  also  be  stated  as  requiring  that,  for  at  least  1  —  e  of  the  examples,  the  classification 
margin 

'^x'^p[w{x)K{x,x)\y{x)  =y{x)]  -'£.x'^p[w{x)K{x,x)\y{x)  y{x)] 

=  y{x)Ex'^p[w{x')y{x)K{x,x')  /  P{y{x))] 

be  at  least  7,  where  P{y{x'))  is  the  marginal  probability  under  P,  i.e.  the  prior,  of  the  label  associated 
with  x'.  We  will  find  if  more  convenienf  in  fhe  following  fo  analyze  insfead  a  slighf  varianf,  dropping  fhe 
factor  1  / P{y{x'))  from  fhe  classification  margin  (3.5 ) — see  Definition  3.3.5  in  fhe  nexf  Section.  Any  sim¬ 
ilarity  function  satisfying  Definition  3.3.5  also  satisfies  Definition  3.3.4  (by  simply  multiplying  w{x')  by 
P{y{x'))).  However,  fhe  learning  algorifhm  using  Definition  3.3.5  is  slighfly  simpler,  and  fhe  connection 
fo  kernels  is  a  bif  more  direcf. 

We  are  now  ready  to  presenf  fhe  main  sufficienl  condifion  for  learning  wifh  similarity  funclions  in  [24|. 
This  is  essentially  a  resfafemenf  of  Definifion  3.3.4,  dropping  fhe  normalizafion  by  fhe  label  “priors”  as 
discussed  af  fhe  end  of  fhe  preceding  Section. 

Definition  3.3.5  (Main  Balcan  -  Blum’06  Definition,  Margin  Violations)  A  similarity  function  K  is  an 
(e,  7)-good  similarity  function /or  a  learning  problem  P  if  there  exists  a  bounded  weighting  function  w 
over  X  (w{x')  G  [0, 1]  for  all  x'  €  X)  such  that  at  least  al  —  e  probability  mass  of  examples  x  satisfy: 

^x'r~.p[y{x)y{x')w{x')K{x,x')]  >  7.  (3.6) 

We  would  like  to  establish  that  the  above  condition  is  indeed  sufficient  for  learning.  I.e.  that  given  an 
(e,  7)-good  similarity  function  K  for  some  learning  problem  P,  and  a  sufficiently  large  labeled  sample 
drawn  from  P,  one  can  obtain  (with  high  probability)  a  predictor  with  error  rate  arbitrarily  close  to  e.  To 
do  so,  we  will  show  how  to  use  an  (e,  7) -good  similarity  function  K,  and  a  sample  S  drawn  from  P,  in 
order  to  construct  (with  high  probability)  an  explicit  mapping  X  ^  for  all  points  in  X  (not  only 
points  in  the  sample  S),  such  that  the  mapped  data  {x),y{x)),  where  x  ~  P,  is  separated  with  error 
close  to  e  (and  in  fact  also  with  large  margin)  in  the  low-dimensional  linear  space  R'^  (Theorem  3.3.3 
below).  We  thereby  convert  the  learning  problem  into  a  standard  problem  of  learning  a  linear  separator, 
and  can  use  standard  results  on  leamability  of  linear  separators  to  establish  leamability  of  our  original 
learning  problem,  and  even  provide  learning  guarantees. 

What  we  are  doing  is  actually  showing  how  to  use  a  good  similarity  function  K  (that  is  not  necessarily 
a  valid  kernel)  and  a  sample  S  drawn  from  P  to  construct  a  valid  kernel  K^,  given  by  K^{x,  x')  = 
(^4>^{x),  4>^{x')'),  that  is  kernel-good  and  can  thus  be  used  for  learning  (In  Section  3.3.5  we  show  that  if 
K  is  already  a  valid  kernel,  a  transformation  is  not  necessary  as  K  itself  is  kernel- good).  We  are  therefore 
leveraging  here  the  established  theory  of  linear,  or  kernel,  learning  in  order  to  obtain  learning  guarantees 
for  similarity  measures  that  are  not  valid  kernels. 

Interestingly,  in  Section  3.3.5  we  also  show  that  any  kernel  that  is  kernel-good  is  also  a  good  similar¬ 
ity  function  (though  with  some  degradation  of  parameters).  The  suggested  notion  of  “goodness”  (Defini¬ 
tion  3.3.5 )  thus  encompasses  the  standard  notion  of  kernel-goodness,  and  extends  it  also  to  non-positive- 
definite  similarity  functions. 

Theorem  3.3.3  Let  K  be  an  {e,y)-good  similarity  function  for  a  learning  problem  P.  For  any  5  >  0, 
let  S  =  {xi,X2,  ■  ■  ■  ,Xd}  be  a  sample  of  size  d  =  81og(l/(5)/7^  drawn  from  P.  Consider  the  mapping 


48 


4>^  :  X  ^  R'^  defined  as  follows:  fix)  =  i  G  {1, . . .  ,d}.  With  probability  at  least  1  —  5 

over  the  random  sample  S,  the  induced  distribution  4>^{P)  in  has  a  separator  of  error  at  most  e  +  5 
at  margin  at  least  7/2. 

Proof:  Let  :  X  ^  [0, 1]  be  the  weighting  function  achieving  (3.6)  of  Definition  3.3.5  Consider  the 
linear  separator  fi  G  R^,  given  by  fii  =  •  ^ote  that  1 1/3||  <  1.  We  have,  for  any  x,  y{x): 

1  ^ 

y{x){P,  f^(x))  =  ^  X]  y(x)y(xi)w(xi)K(x,  Xi)  (3.7) 

i=l 

The  right  hand  side  of  the  (3.7)  is  an  empirical  average  of  —1  <  y{x)y{x')w{x')K{x,  x')  <  1,  and  so  by 
Hoeffding’s  inequality,  for  any  x,  and  with  probability  at  least  1  —  5^  over  the  choice  of  S,  we  have: 


^'^y{x)y{xi)w{xi)K{x,Xi)  >E^,^p[y{x)y{x')w{x')K{x,x')]  - 

i=l 

Since  the  above  holds  for  any  x  with  probability  at  least  1  —  5^  over  the  choice  of  S,  it  also  holds  with 
probability  at  least  1  —  5^  over  the  choice  of  x  and  S.  We  can  write  this  as: 


'21og( 


5^) 


(3.8) 


Pr  ( violation ) 


<  5^ 


(3.9) 


where  “violation”  refers  to  violating  (3.8).  Applying  Markov’s  inequality  we  get  that  with  probability  at 
least  1  —  5  over  the  choice  of  S,  at  most  5  fraction  of  points  violate  (3.8).  Recalling  Definition  3.3.5,  at 
most  an  additional  e  fraction  of  the  points  violate  (3.6).  But  for  the  remaining  1  —  e  —  5  fraction  of  the 

points,  for  which  both  (3.8)  and  (3.6)  hold,  we  have:  y{x)(^fi,  4>^{x))  >  7  —  =  7/2,  where  to 

get  the  last  inequality  we  use  d  =  81og(l/5)/7^.  ■ 

We  can  learn  a  predictor  with  error  rate  at  most  e  +  eacc  using  an  (e,  7)-good  similarity  function  K 
as  follows.  We  first  draw  from  P  a  sample  5  =  {xi,X2,  ■  ■  ■  ,Xd}  of  size  d  =  (4/7)^  ln(4/5eacc)  and 
construct  the  mapping  :  X  ^  R'^  defined  as  follows:  fix)  =  i  C  The 

guarantee  we  have  is  that  with  probability  at  least  1  —  5  over  the  random  sample  S,  the  induced  dis¬ 
tribution  c/)^{P)  in  7?'^,  has  a  separator  of  error  at  most  e  +  eacc/2  at  margin  at  least  7/2.  So,  to  learn 
well,  we  then  draw  a  new,  fresh  sample,  map  it  into  the  transformed  space  using  and  then  learn 
a  linear  separator  in  transformed  space  using  the  new  space.  The  number  of  landmarks  is  domi¬ 
nated  by  the  0{{e  +  ^acfid/e^^fi)  =  0[{e  +  eacc)/(7^eLc))  sample  complexity  of  the  linear  learning, 
yielding  the  same  order  sample  complexity  as  in  the  kernel-case  for  achieving  error  at  most  e  -|-  Cacc: 
0((e  +  eacc)/(7^eL))- 

Unfortunately,  the  above  sample  complexity  refers  to  learning  by  finding  a  linear  separator  minimizing 
the  error  over  the  training  sample.  This  minimization  problem  is  NP-hard  |18|,  and  even  NP-hard  to 
approximate  |20|.  In  certain  special  cases,  such  as  if  the  induced  distribution  4>^{P)  happens  to  be  log- 
concave,  efficient  learning  algorithms  exist  [  145 1.  However,  as  discussed  earlier,  in  the  more  typical  case, 
one  minimizes  the  hinge-loss  instead  of  the  number  of  errors.  We  therefore  consider  also  a  modification 
of  Definition  3.3.5  that  captures  the  notion  of  good  similarity  functions  for  the  SVM  and  Perceptron 
algorithms  as  follows: 


49 


Definition  3.3.6  (Main  Balcan  -  Blum’06  Definition,  Hinge  Loss)  A  similarity  function  K  is  an  (e,  7)- 
good  similarity  function  in  hinge  loss  for  a  learning  problem  P  if  there  exists  a  weighting  function 
w{x')  G  [0, 1]  for  all  x'  €  X  such  that 


[1  -  y{x)g{x)/y]. 


<  e, 


(3.10) 


where  g{x)  =  Ex'r^p[y{x')w{x')K{x,  x')]  is  the  similarity-based  prediction  made  using  w{),  and  recall 
that  [1  —  2;]_|_  =  max(0, 1  —  z)  is  the  hinge-loss. 

In  other  words,  we  are  asking:  on  average,  by  how  much,  in  units  of  7,  would  a  random  example  x  fail 
to  satisfy  the  desired  7  separation  between  the  weighted  similarity  to  examples  of  its  own  label  and  the 
weighted  similarity  to  examples  of  the  other  label. 

Similarly  to  Theorem  3.3.3,  we  have: 

Theorem  3.3.4  Let  K  be  an  {e,y)-good  similarity  function  in  hinge  loss  for  a  learning  problem  P.  For 
any  ei  >  0  and  0  <  6  <  7ei/4  let  S  =  {xi,X2, . . .  ,Xd}  be  a  sample  of  size  d  =  161og(l/(5)/(ei7)^ 
drawn  from  P.  With  probability  at  least  1  —  6  over  the  random  sample  S,  the  induced  distribution  4>^{P) 
in  for  as  defined  in  Theorem  3.3.3.  has  a  separator  achieving  hinge-loss  at  most  e-\-  ei  at  margin 
at  least  7. 

Proof:  Let  m  :  X  — >  [0, 1]  be  the  weighting  function  achieving  an  expected  hinge  loss  of  at  most  e  at 
margin  7,  and  denote  g{x)  =  Ep^p[y{x')w{x')K{x,  x')].  Defining  jd  as  in  Theorem  3.3.3  and  following 
the  same  arguments  we  have  that  with  probability  at  least  1  —  6  over  the  choice  of  S,  at  most  6  fraction  of 
the  points  x  violate  3.8  We  will  only  consider  such  samples  S.  For  those  points  that  do  not  violate  (3.8) 
we  have: 


[1  -y(x)(/d,f^(x))/j]+  <  [1  -y(x)p(a:)/7]  +  iy  <  [1  -  y(x)g(x)/j]+ -h  ei/2  (3.11) 

For  points  that  do  violate  ( 3.8 ),  we  will  just  bound  the  hinge  loss  by  the  maximum  possible  hinge-loss: 

[1  -y(x)(/3,f^(x))/y]+  <  1  +  max  |y(x)||/3|  ||  |(/>'^(x)|  1 1 /7  <  1  +  I/7  <  2/7  (3.12) 

Combining  these  two  cases  we  can  bound  the  expected  hinge-loss  at  margin  7: 

Ex~p[[l  -  y(x)(/3,f^(x))/j]+]  <  Ea:....p[[l  -  y(x)g(x)/y]+]  +  ei/2  -f  Pr (violation )  •  (2/7) 

<  E^r^p[[l  -  y{x)g{x)/y]+]  -f  ei/2  -f  26/ j 

<  E^r..p[[l-y{x)g{x)/'y]+]  +  ei,  (3.13) 

where  the  last  inequality  follows  from  6  <  ei7/4.  ■ 

We  can  learn  a  predictor  with  error  rate  at  most  e  -|-  Cacc  using  an  (e,  7)-good  similarity  function  K 
as  follows.  We  first  draw  from  P  a  sample  S  =  {xi,  X2,  ■  ■  ■ ,  Xd}  of  size  d  =  16  \og{2/6) /(eacc7)^  and 
construct  the  mapping  :  X  ^  defined  as  follows:  f^fx)  =  .  *  £  {1;  •  •  •  j  The  guarantee 

we  have  is  fhaf  wifh  probabilify  af  leasf  1  —  <5  over  fhe  random  sample  S,  fhe  induced  disfribufion  4>^{P) 
in  R'^,  has  a  separafor  achieving  hinge-loss  af  mosf  e  -|-  eacc/2  af  margin  7.  So,  fo  learn  well,  we  can  fhen 
use  an  SVM  solver  in  fhe  (/I'^-space  fo  obfain  (wifh  probabilify  af  leasf  1  —  26)  a  predicfor  wifh  error  rafe 
e  +  Cacc  using  O  [l / examples,  and  time  polynomial  in  1/7,1/eacc  and  log(l/(5). 


50 


3.3.4  Extensions 


We  present  here  a  few  extensions  of  our  basic  setting  in  Section  3.3.3.  For  simplicity,  we  only  consider 
the  margin-violation  version  of  our  definitions,  but  all  the  results  here  can  be  easily  extended  to  the  hinge 
loss  case  as  well. 


Combining  Multiple  Similarity  Functions 


Suppose  that  rather  than  having  a  single  similarity  function,  we  were  instead  given  n  functions  iFi , . . . ,  K^, 
and  our  hope  is  that  some  convex  combination  of  them  will  satisfy  Definition  3.3.5  Is  this  sufficient  to 
be  able  to  learn  well?  (Note  that  a  convex  combination  of  similarity  functions  is  guaranteed  to  have  range 
[— 1, 1]  and  so  be  a  legal  similarity  function.)  The  following  generalization  of  Theorem  3.3.3  shows  that 
this  is  indeed  the  case,  though  the  margin  parameter  drops  by  a  factor  of  yTr.  This  result  can  be  viewed 
as  analogous  to  the  idea  of  learning  a  kernel  matrix  studied  by  [157|  except  that  rather  than  explicitly 
learning  the  best  convex  combination,  we  are  simply  folding  the  learning  process  into  the  second  stage  of 
the  algorithm. 


Theorem  3.3.5  Suppose  Ki, . . . ,  Kn  are  similarity  functions  such  that  some  (unknown)  convex  combina¬ 
tion  of  them  is  (e,  'y)-good.  If  one  draws  aset  S  =  {xi,  X2, . .  ■ ,  Xd}from  P  containing  d  =  81og(l/(5)/7^ 
examples,  then  with  probability  at  least  1  —  <5,  the  mapping  :  X  ^  R'^‘^  defined  as  (j)^{x)  = 


p^{x)  =  {Ki{x,  Xi),...,  Ki{x,  Xd),  ...,  Kn{x,  Xi), . . . ,  Kn{x,  yd)) 


has  the  property  that  the  induced  distribution  4>^{P)  in  has  a  separator  of  error  at  most  e  6  at 
margin  at  least 

Proof:  Let  K  =  aiKi  + . . .  +  UnKn  be  an  (e,  7)-good  convex-combination  of  the  Ki.  By  Theorem  3.3.3 , 
had  we  instead  performed  the  mapping:  ■.  X  ^  defined  as  f^(x)  = 

p^{x)  =  {K{x,  xi), . . .  ,K{x,  Xd)) 

fhen  wifh  probabilify  1  —  S,  fhe  induced  disfribufion  f^{P)  in  R'^  would  have  a  separafor  of  error  af  mosf 
e  -|-  5  af  margin  af  leasf  7/2.  Lef  (5  be  fhe  vecfor  corresponding  fo  such  a  separafor  in  fhaf  space.  Now, 
lef  us  converf  $  info  a  vecfor  in  R^'^  by  replacing  each  coordinate  Pj  wifh  fhe  n  values  (aiPj, . . . ,  OnPj). 
Call  fhe  resulfing  vector  (5.  Nofice  fhaf  by  design,  for  any  x  we  have  (^f5 ,  (x)'^  =  (x)'^. 

Furfhermore,  1 1/3|  |  <  1 1/3|  |  <  1  (fhe  worsf  case  is  when  exacfly  one  of  fhe  a,  is  equal  fo  1  and  fhe  resf  are 
0).  Thus,  fhe  vecfor  P  under  disfribufion  (j)^{P)  has  fhe  similar  properties  as  fhe  vecfor  [3  under  p^{P)-, 
so,  using  fhe  proof  of  Theorem  3.3.3  we  obfain  fhaf  fhaf  fhe  induced  disfribufion  f^iP)  in  R^'^  has  a 
separafor  of  error  af  mosf  e  -|-  <5  af  margin  af  leasf  7/ (2^/n).  ■ 

Nofe  fhaf  fhe  above  argumenf  acfually  shows  somefhing  a  bif  sfronger  fhan  Theorem  3.3.5  In  parfic- 
ular,  if  we  define  a  =  (ai, . . . ,  an)  to  be  the  mixture  vector  for  the  optimal  K,  then  we  can  replace  the 
margin  bound  with  y / (2\\a\\^/n).  For  example,  if  a  is  the  uniform  mixture,  then  we  just  get 

the  bound  in  Theorem  3.3.3  of  7/2. 

Also  note  that  if  we  are  in  fact  using  an  Li-based  learning  algorithm  then  we  could  do  much  better  — 
for  details  on  such  an  approach  see  Section  3.4.6 


51 


Multi-class  Classification 


We  can  naturally  extend  all  our  results  to  multi-class  classification.  Assume  for  concreteness  that  there 
are  r  possible  labels,  and  denote  the  space  of  possible  labels  by  y  =  {1,  •  •  •  ,  r};  thus,  by  a  multi-class 
learning  problem  we  mean  a  distribution  P  over  labeled  examples  (x,  y{x)),  where  x  ^  X  and  y{x)  G  Y. 

For  this  multi-class  setting.  Definition  3.3.4  seems  most  natural  to  extend.  Specifically: 

Definition  3.3.7  (main,  multi-class)  A  similarity  function  K  is  an  (e,  7)-good  similarity  function /or  a 
multi-class  learning  problem  P  if  there  exists  a  bounded  weighting  function  w  over  X  (w{x')  G  [0,  l]/or 
all  x'  €  X)  such  that  at  least  al  —  e  probability  mass  of  examples  x  satisfy: 

Ex>r^p[w{x)K{x,x)\y{x)=y{x)]  >  Ex'r^p[w{x)K{x,x)\y{x)=i]-\-'y  for  alH  G  F,  f  /  ?/(x) 

We  can  then  extend  the  argument  in  Theorem  3.3.3  and  learn  using  standard  adaptations  of  linear-separator 
algorithms  to  the  multiclass  case  (e.g.,  see  [  110|). 

3.3.5  Relationship  Between  Good  Kernels  and  Good  Similarity  Measures 

As  discussed  earlier,  the  similarity-based  theory  of  learning  is  more  general  than  the  traditional  kernel- 
based  theory,  since  a  good  similarity  function  need  not  be  a  valid  kernel.  However,  for  a  similarity  function 
K  that  is  a  valid  kernel,  it  is  interesting  to  understand  the  relationship  between  the  learning  results  guar¬ 
anteed  by  the  two  theories.  Similar  learning  guarantees  and  sample  complexity  bounds  can  be  obtained 
if  K  is  either  an  (e,  7)-good  similarity  function,  or  a  valid  kernel  and  (e,  7) -kernel- good.  In  fact,  as  we 
saw  in  Section  3.3.3,  the  similarity-based  guarantees  are  obtained  by  transforming  (using  a  sample)  the 
problem  of  learning  with  an  (e,  7)-good  similarity  function  to  learning  with  a  kernel  with  essentially  the 
same  goodness  parameters.  This  is  made  more  explicit  in  Corollary  3.3.1 1 

In  this  section  we  study  the  relationship  between  a  kernel  function  being  good  in  the  similarity  sense 
of  Definitions  3.3.5  and  3.3.6  and  good  in  the  kernel  sense.  We  show  that  a  valid  kernel  function  that  is 
good  for  one  notion,  is  in  fact  good  also  for  the  other  notion.  The  qualitative  notions  of  being  “good” 
are  therefore  equivalent  for  valid  kernels,  and  so  in  this  sense  the  more  general  similarity-based  notion 
subsumes  the  familiar  kernel-based  notion. 

However,  as  we  will  see,  the  similarity-based  margin  of  a  valid  kernel  might  be  lower  than  the  kernel- 
based  margin,  yielding  a  possible  increase  in  the  sample  complexity  guarantees  if  a  kernel  is  used  as 
a  similarity  measure.  We  also  show  that  for  a  valid  kernel,  the  kernel-based  margin  is  never  smaller 
than  the  similarity-based  margin.  We  provide  a  tight  bound  on  this  possible  deterioration  of  the  margin 
when  switching  to  the  similarity -based  notion  given  by  definitions  3.3.5  and  3.3.6.  (Note  also  that  in  the 
following  section  3.4  we  provide  an  even  better  notion  of  a  good  similarity  function  that  provides  a  better 
kernels  to  similarity  translations.) 

Specifically,  we  show  that  if  a  valid  kernel  function  is  good  in  the  similarity  sense,  it  is  also  good  in 
the  standard  kernel  sense,  both  for  the  margin  violation  error  rate  and  for  the  hinge  loss: 

Theorem  3.3.6  (A  kernel  good  as  a  similarity  function  is  also  good  as  a  kernel)  If  K  is  a  valid  kernel 
function,  and  is  (e,  y)-good  similarity  for  some  learning  problem,  then  it  is  also  (e,  'y)-kemel-goodfor  the 
learning  problem.  If  K  is  (e,  y)-good  similarity  in  hinge  loss,  then  it  is  also  (e,  y)-kemel-good  in  hinge 
loss. 

We  also  show  the  converse — If  a  kernel  function  is  good  in  the  kernel  sense,  it  is  also  good  in  the 
similarity  sense,  though  with  some  degradation  of  the  margin: 


52 


Theorem  3.3.7  (A  good  kernel  is  also  a  good  similarity  function — ^Margin  violations)  If  K  is  (eo,  7)- 
kernel-good  for  some  learning  problem  (with  deterministic  labels),  then  it  is  also  (eo  +  ei,  ^(1  —  eo)ei7^)- 
good  similarity  for  the  learning  problem,  for  any  ei  >  0. 

Note  that  in  any  useful  situation  eo  <  and  so  the  guaranteed  margin  is  at  least  |ei7^.  A  similar 
guarantee  holds  also  for  the  hinge  loss: 

Theorem  3.3.8  (A  good  kernel  is  also  a  good  similarity  function — Hinge  loss)  If  K  is  {eo,'y)-kernel- 
good  in  hinge  loss  for  learning  problem  (with  deterministic  labels),  then  it  is  also  (eo  +  ei,  2ei7^)-good 
similarity  in  hinge  loss  for  the  learning  problem,  for  any  ei  >  0. 

These  results  establish  that  treating  a  kernel  as  a  similarity  function  would  still  enable  learning,  al¬ 
though  with  a  somewhat  increased  sample  complexity.  As  we  show,  the  deterioration  of  the  margin  in  the 
above  results,  which  yields  an  increase  in  the  sample  complexity  guarantees,  is  unavoidable: 

Theorem  3.3.9  (Tightness,  Margin  Violations)  For  any  0  <  7  <  and  any  0  <  ei  <  |,  there  exists 
a  learning  problem  and  a  kernel  function  K,  which  is  {D,y)-kernel-good  for  the  learning  problem,  but 
which  is  only  (ei,4ei'y‘^)-good  similarity.  That  is,  it  is  not  (€i,y')-good  similarity  for  any  7'  >  4ei7^. 

Theorem  3.3.10  (Tightness,  Hinge  Loss)  For  any  0  <  7  <  and  any  0  <  ei  <  there  exists 
a  learning  problem  and  a  kernel  function  K,  which  is  {0,j) -kernel-good  in  hinge  loss  for  the  learning 
problem,  but  which  is  only  (ei,  2>2tiy‘^)-good  similarity  in  hinge  loss. 

To  prove  Theorem  3.3.6  we  will  show,  for  any  weight  function,  an  explicit  low-norm  linear  predictor 
(5  (in  the  implied  Hilbert  space),  with  equivalent  behavior.  To  prove  Theorems  3.3.7  and  3.3.8,  we  will 
consider  a  kernel  function  that  is  (eo,  7) -kernel-good  and  show  that  it  is  also  good  as  a  similarity  function. 
We  will  first  treat  goodness  in  hinge-loss  and  prove  Theorem  3.3.8 ,  which  can  be  viewed  as  a  more  general 
result.  This  will  be  done  using  the  representation  of  the  optimal  SVM  solution  in  terms  of  the  dual  optimal 
solution.  We  then  prove  Theorem  3.3.7  in  terms  of  the  margin  violation  error  rate,  by  using  the  hinge-loss 
as  a  bound  on  the  error  rate.  To  prove  Theorems  3.3.9  and  3.3.10,  we  present  an  explicit  learning  problem 
and  kernel. 

Transforming  a  Good  Similarity  Function  to  a  Good  Kernel 

Before  proving  the  above  Theorems,  we  briefly  return  to  the  mapping  of  Theorem  3.3.3  and  explicitly 
present  it  as  a  mapping  between  a  good  similarity  function  and  a  good  kernel: 

Corollary  3.3.11  (A  good  similarity  function  can  be  transformed  to  a  good  kernel)  If  K  is  an  (e,  7)- 

good  similarity  function  for  some  learning  problem  P,  then  for  any  0  <  5  <  1,  given  a  sample  of  S  size 
(8/7^)  log(l/(5)  drawn  from  P,  we  can  construct,  with  probability  at  least  1  —  6  over  the  draw  of  S,  a 
valid  kernel  that  is  (e  -|-  6,  y  j2)-kernel  good  for  P. 

If  K  is  a  (€,'y)-good  similarity  function  in  hinge-loss  for  some  learning  problem  P,  then  for  any 
ei  >  0  and  0  <  6  <  'yei/A,  given  a  sample  of  S  size  161og(l/(5)  /  (£17)^  drawn  from  P,  we  can  construct, 
with  probability  at  least  1  —  5  over  the  draw  of  S,  a  valid  kernel  that  is  (e  -|-  ei,y) -kernel  good  for  P. 

Proof:  Pet  K^(x,x')  =  (^(j)^(x),(t)^(x'))  where  is  the  transformation  of  Theorems  3.3.3  and  3.3.4 


From  this  statement,  it  is  clear  that  kernel-based  learning  guarantees  apply  also  to  learning  with  a  good 
similarity  function,  essentially  with  the  same  parameters. 


53 


It  is  important  to  understand  that  the  result  of  Corollary  3.3.11  is  of  a  very  different  nature  than  the 
results  of  Theorems  3.3.6  -  3.3.10  The  claim  here  is  not  that  a  good  similarity  function  is  a  good  kernel 
—  it  can’t  be  if  it  is  not  positive  semi-definite.  But,  given  a  good  similarity  function  we  can  create  a  good 
kernel.  This  transformation  is  distribution-dependent,  and  can  be  calculated  using  a  sample  S. 

Proof  of  Theorem  3.3.6 

Consider  a  similarity  function  K  that  is  a  valid  kernel,  i.e.  K{x,  x')  =  {(t){x),  4>{x'))  for  some  mapping 
(j)  of  X  to  a  Hilbert  space  7i.  For  any  input  distribution  and  any  valid  weighting  w{x)  of  the  inputs  (i.e. 
0  <  w{x)  <  1),  we  will  construct  a  linear  predictor  f5w  G  Ti,  with  ||/3«)||  <  1,  such  that  similarity-based 
predictions  using  w  are  the  same  as  the  linear  predictions  made  with 
Define  the  following  linear  predictor  [5^, 

f5w  =  [v{x)w{x)(t){x)] . 


The  predictor  (3^  has  norm  at  most: 

WPwW  =  \\E^'[y{x')w{x')(p{x')]\\  <max\\y{x')w{x')(j){x')\\ 

x' 

<  max  ||^(x^)|  I  =  max  y/K{x',  x')  <  1 

where  the  second  inequality  follows  from  l-wCx')!,  |y(x')l  <  1- 
The  predictions  made  by  jdw  are: 

(/3^,(/>(x))  =  {^x'[y{x)w{x)(t){x)\,(t){x)) 

=  ¥.x'[y{x)w{x){(i){x),(t){x))]  =  ¥.x'[y{x)w{x)K{x,x)] 

That  is,  using  is  the  same  as  using  similarity-based  prediction  with  w.  In  particular,  if  the  margin 
violation  rate,  as  well  as  the  hinge  loss,  with  respect  to  any  margin  7,  is  the  same  for  predictions  made 
using  either  w  or  /3^.  This  is  enough  to  establish  Theorem  3.3.6:  If  K  is  (e,  7)-good  (perhaps  for  to  the 
hinge-loss),  there  exists  some  valid  weighting  w  the  yields  margin  violation  error  rate  (resp.  hinge  loss) 
at  most  e  with  respect  to  margin  7,  and  so  f3w  yields  the  same  margin  violation  (resp.  hinge  loss)  with 
respect  to  the  same  margin,  establishing  K  is  (e,  7) -kernel-good  (resp.  for  the  hinge  loss). 

Proof  of  Theorem  3.3.8 ;  Guarantee  on  the  Hinge  Loss 

Recall  that  we  are  considering  only  learning  problems  where  the  label  y  is  a  deterministic  function  of  x. 
For  simplicity  of  presentation,  we  first  consider  finite  discrete  distributions,  where: 


Fi{xi,yi )  =  Pi 


(3.14) 


for  z  =  1 . . .  n,  with  XlILi  P*  =  1  Xi  /  xj  for  i  /  j. 

Let  K  be  any  kernel  function  that  is  (eo,7)-kemel  good  in  hinge  loss.Let  4>  be  the  implied  feature 
mapping  and  denote  =  4’{xi).  Consider  the  following  weighted-SVM  quadratic  optimization  problem 
with  regularization  parameter  C: 

1  " 

minimize  -\\P\\^  -^C'^pi[l  -  yi(/3,  ())i)]+  (3.15) 

^=1 


54 


The  dual  of  this  problem,  with  dual  variables  a,,  is: 


maximize  X]  “  9  yiyjOiiajK{xi,  xj) 


subject  to  0  <  a,  <  Cpi 


(3.16) 


There  is  no  duality  gap,  and  furthermore  the  primal  optimum  f3*  can  be  expressed  in  terms  of  the  dual 
optimum  a*\  (5*  = 

Since  K  is  (eo,  7) -kernel-good  in  hinge-loss,  there  exists  a  predictor  ||/3o||  =  1  with  average-hinge 
loss  eo  relative  to  margin  7.  The  primal  optimum  (3*  of  (3.15 ),  being  the  optimum  solution,  then  satisfies: 


[1  -  y(^PoA{x) )] 


=  ^  +  Ceo  (3.17) 


Since  both  terms  on  the  left  hand  side  are  non-negative,  each  of  them  is  bounded  by  the  right  hand  side, 
and  in  particular: 

<  ^TCeo  (3.18) 

i  ' 

Dividing  by  C  we  get  a  bound  on  the  average  hinge-loss  of  the  predictor  /?*,  relative  to  a  margin  of  one: 


E[[l -y{f3*,cj>{x))U]  < 


1 


2Cr 


+  eo 


(3.19) 


We  now  use  the  fact  that  P*  can  be  written  as  /3*  =  Ci*yi4>i  with  0  <  a*  <  Cpi.  Using  the  weights 


we  have  for  every  x,y. 


Wi  =  w{xi)  =  a*l{Cpi)  <  1 


(3.20) 


y^x',y'  [  w{x)y' K{x,x')\  =  y'^piw{xi)yiK{x,Xi)  (3.21) 

i 

=  y'^Pia*iyiK{x,Xi)/{Cpi) 
i 

=  y'^(4yi'\(l^i^(t>{x))/c  =  y{p*,4>{x))/c 

i 

Multiplying  by  C  and  using  (3.19): 

'^x,y  [  [  1  -  [w{x')y'K{x,  x')]  ]+  ]  =  E^^y[  [  1  -  y{P*,(j){x))  ]+  ]  <  +  eo  (3.22) 

This  holds  for  any  C,  and  describes  the  average  hinge-loss  relative  to  margin  1/C.  To  get  an  average 
hinge-loss  of  eo  -|-  ei,  we  set  C  =  l/(2ei7^)  and  get: 

^x,y  [  [  1  -  yEx'^y'  [w{x)yK{x,  x)]  / (2ei7^)  ]-e  ]  <  eo  +  ei  (3.23) 

This  establishes  that  K  is  (eo  -|-  ei,  2ei7^)-good  similarity  in  hinge-loss. 


55 


Non-discrete  distributions 


The  same  arguments  apply  also  in  the  general  (not  necessarily  discrete)  case,  except  that  this  time,  instead 
of  a  fairly  standard  (weighted)  SVM  problem,  we  must  deal  with  a  variational  optimization  problem, 
where  the  optimization  variable  is  a  random  variable  (a  function  from  the  sample  space  to  the  reals).  We 
will  present  the  dualization  in  detail. 

We  consider  the  primal  objective 

minimize  ^||/3||^  +  -  ?/(/3,  </>)]+]  (3.24) 

where  the  expectation  is  w.r.t.  the  distribution  P,  with  cj)  =  (j){x)  here  and  throughout  the  rest  of  this 
section.  We  will  rewrite  this  objective  using  explicit  slack,  in  the  form  of  a  random  variable  which  will 
be  a  variational  optimization  variable: 

minimize  ^||/?|P  +  C'E[,^] 

subject  to  Pr(  1  —  y(/3,  </>)  —  ^  <  0  )  =  1  (3.25) 

Pr(^  >  0)  =  1 


In  the  rest  of  this  section  all  our  constraints  will  implicitly  be  required  to  hold  with  probability  one.  We 
will  now  introduce  the  dual  variational  optimization  variable  a,  also  a  random  variable  over  the  same 
sample  space,  and  write  the  problem  as  a  saddle  problem: 

min^^g  max„  ^  1 1/3|  ^  +  CE[e]  +  E[a(l  -  y{/3,  </>)  -  0]  ^6) 

subject  to  ^  >  0  a  >  0 


Note  that  this  choice  of  Lagrangian  is  a  bit  different  than  the  more  standard  Lagrangian  leading  to  (3. 16). 
Convexity  and  the  existence  of  a  feasible  point  in  the  dual  interior  allows  us  to  change  the  order  of  max¬ 
imization  and  minimization  without  changing  the  value  of  the  problem,  even  in  the  infinite  case  |134|. 
Rearranging  terms  we  obtaining  the  equivalent  problem: 

maxa  min^,^  ^||/3||^  -  {E[ay(j)],  P)  +  E[^{C  -  a)]  + 
subject  to  C  >  0,  a  >  0 


E\a] 


(3.27) 


Similarly  to  the  finite  case,  we  see  that  the  minimum  of  the  minimization  problem  is  obtained  when 
P  =  E[aycj)]  and  that  it  is  finite  when  a  <  C  almost  surely,  yielding  the  dual: 


maximize  E[q;]  — 
subject  to 


^E[ayayK{x,x')] 

0<a<C 


(3.28) 


where  (x,  y,  a)  and  (x',  y' ,  a')  are  two  independent  draws  from  the  same  distribution.  The  primal  optimum 
can  be  expressed  as  P*  =  E[a*y(/)],  where  a*  is  the  dual  optimum.  We  can  now  apply  the  same  arguments 
as  in  (3.17),(3.18)  to  get  (3.19).  Using  the  weight  mapping 


w{x)  =  E[a*|x]  !  C  <1 


(3.29) 


we  have  for  every  x,y. 

yEx',y'  [w{x')y'K{x,  x')]  =  y{Ex',y,a’  [a'y'x']  ,x)fC  =  y{P* ,  4>{x)) /C.  (3.30) 

From  here  we  can  already  get  (3.22)  and  setting  C  =  l/(2ei7^)  we  get  (3.23),  which  establishes  Theorem 
3.3.8  for  any  learning  problem  (with  deterministic  labels). 


56 


Proof  of  Theorem  3.3.7 ;  Guarantee  on  Margin  Violations 

We  will  now  turn  to  guarantees  on  similarity-goodness  with  respect  to  the  margin  violation  error-rate. 
We  base  these  on  the  results  for  goodness  in  hinge  loss,  using  the  hinge  loss  as  a  hound  on  the  margin 
violation  error-rate.  In  particular,  a  violation  of  margin  7/2  implies  a  hinge-loss  at  margin  7  of  at  least 
Therefore,  twice  the  average  hinge-loss  at  margin  7  is  an  upper  hound  on  the  margin  violation  error  rate 
at  margin  7/2. 

The  kernel-separahle  case,  i.e.  eo  =  0,  is  simpler,  and  we  consider  it  first.  Having  no  margin  violations 
implies  zero  hinge  loss.  And  so  if  a  kernel  K  is  (0, 7) -kernel-good,  it  is  also  (0, 7) -kernel-good  in  hinge 
loss,  and  hy  Theorem  3.3.8  it  is  (ei/2, 2(ei/2)7^)-good  similarity  in  hinge  loss.  Now,  for  any  ei  >  0,  by 
hounding  the  margin  5617^  error-rate  hy  the  €17^  average  hinge  loss,  K  is  (ei,  2ei7^)-good  similarity, 
establishing  Theorem  3.3.7  for  the  case  eo  =  0. 

We  now  return  to  the  non-separable  case,  and  consider  a  kernel  K  that  is  (eo,  7) -kernel-good,  with 
some  non-zero  error-rate  eo.  Since  we  cannot  bound  the  hinge  loss  in  terms  of  the  margin- violations,  we 
will  instead  consider  a  modified  disfribufion  where  fhe  margin-violafions  are  removed. 

Lef  (5*  be  fhe  linear  classifier  achieving  eo  margin  violafion  error-rafe  wifh  respecf  fo  margin  7,  i.e. 
such  fhaf  Pr ( y{f5* ,  x)  >  7 )  >  1  —  eo-  We  will  consider  a  disfribufion  which  is  condifioned  on  y (/9* ,  x)  > 
7.  We  denofe  fhis  evenf  as  Ok(x)  (recall  fhaf  y  is  a  deferminisfic  function  of  x).  The  kernel  K  is  obviously 
(0, 7)-kemel-good,  and  so  by  fhe  argumenfs  above  also  (ei,  ^ei7^)-good  similarify,  on  fhe  conditional 
disfribufion.  Lef  w  be  fhe  weighf  mapping  achieving 

Pr(yEa,/y  [t(;(x')y'A:(x,x  )|ok(x')]  <  7i|ok(x)  )  <  ei,  (3.31) 

^iH 

where  71  =  5617^,  and  sef  w{x)  =  0  when  Ok(x)  does  nof  hold.  We  have: 

FT:{yE^/^yi[w{x  )y'K{x,x)]  <  (1  -  eo)7i ) 

<  Pr(  nof  OK(x)  )  -|-  Pr(  OK(x)  )Pr  (  yE^i  „/  \w{x)y  Kix,  x')l  <  (1  —  eo)7i  I  OK(x)  ) 
=  eo  +  (l-eo)Pr(y(l-eo)Ea;/y  [m(x')y'Ar(x,x')|OK(x)]  <  (l-eo)7i|OK(x) ) 

=  eo  +  (1  -  eo)Pr(yE^/y  [r(;(x')y'iT(x,x')|OK(x)]  <7i|ok(x)) 

<  Co  +  (1  ~  £0)^1  <  Co  +  Cl  (3.32) 

esfablishing  fhaf  K  is  (eo  +  ei, 71) -good  similarify  for  fhe  original  (unconditioned)  disfribufion,  fhus 
yielding  Theorem  3.3.7 

Tightness 

We  now  furn  fo  proving  of  Theorems  3.3.9  and  3.3.10  This  is  done  by  presenting  a  specific  disfribufion 
P  and  kernel  in  which  fhe  guaranfees  hold  fighfly. 

Consider  fhe  sfandard  Euclidean  inner-producf  and  a  disfribufion  on  four  labeled  poinfs  in  R^,  given 
by: 

XI  =  (7,7,  y/i  -  27^),  yi  =  1,  = 

X2  =  (7,  27^),  2/2  =  1,  P2  =  € 

X3  =  (-7,7,  -  27^)>  2/3  = -1,  P3  =  € 

X4  =  (-7,  -7,  \/l  -  27^),  2/4  =  -1,  P4  =  ^-€ 


57 


for  some  (small)  0  <  7  <  y  ^  and  (small)  probability  0  <  e  <  The  four  points  are  all  on  the  unit 
sphere  (i.e.  ||xi||  =  1  and  so  iT(xj,  Xj)  =  {xi,Xj)  <  1),  and  are  clearly  separated  by /3  =  (1,0,0)  with  a 
margin  of  7.  The  standard  inner-product  kernel  is  therefore  (0, 7) -kernel-good  on  this  distribution. 

Proof  of  Theorem  3.3.9 ;  Tightness  for  Margin- Violations 

We  will  show  that  when  this  kernel  (the  standard  inner  product  kernel  in  R^)  is  used  as  a  similarity 
function,  the  best  margin  that  can  be  obtained  on  all  four  points,  i.e.  on  at  least  1  —  e  probability  mass  of 
examples,  is  867^. 

Consider  the  classification  margin  on  point  X2  with  weights  w  (denote  Wi  =  w{xi)): 
E[w{x)yK{x2,x)] 

=  (^  -  -  7^  +  (1  -  27^))  +  ew^2(27^  +  (1  -  27^)) 

-ew3{-2'y^  +  (1  -  27^))  -  -  e)w^4(-7^  +  7^  +  (1  -  27^)) 

=  (^{^-e){wi-W4)  +  e{w2-W3)^  +  2e{w2  +  W3)j‘^  (3.33) 

If  the  first  term  is  positive,  we  can  consider  the  symmetric  calculation 

-E[w{x)yK{x3,x)]  =  -  ^(^  -  e){wi  -  W4)  +  e{w2  -  ^3)^  (1  -  27^)  -f  2e{w2  +  ^3)7^ 

in  which  the  first  term  is  negated.  One  of  the  above  margins  must  therefore  be  at  most 

2e{w2  +  1^3)7^  <  4e7^  (3.34) 


This  establishes  Theorem  3.3.9 


Proof  of  Theorem  3.3.10;  Tightness  for  the  Hinge  Loss 

In  the  above  example,  suppose  we  would  like  to  get  an  average  hinge-loss  relative  to  margin  71  of  at  most 

o: 

Ex,y  [  [  1  -  yEx'^y'  [w{x')yK{x,  x')]  /71  ]-r  ]  <  ei  (3-35) 

Following  the  arguments  above,  equation  (3.34)  can  be  used  to  bound  the  hinge-loss  on  at  least  one  of  the 
points  X2  or  X3,  which,  multiplied  by  the  probability  e  of  the  point,  is  a  bound  on  the  average  hinge  loss: 

Ex,y[[l-yEx'^y'[w{x')y'K{x,x')\hi]+]  >  e(l  -  4e7^/7i)  (3.36) 


and  so  to  get  an  an  average  hinge-loss  of  at  most  ei  we  must  have: 


7i  < 


4e7^ 

1  -  ei/e 


(3.37) 


For  any  target  hinge-loss  ei,  consider  a  distribution  with  e  =  2ei,  in  which  case  we  get  that  the  maximum 
margin  attaining  average  hinge-loss  ei  is  71  =  16ei7^,  ^ven  though  we  can  get  a  hinge  loss  of  zero  at 
margin  7  using  a  kernel.  This  establishes  Theorem  3.3.10. 


58 


Note:  One  might  object  that  the  example  used  in  Theorems  3.3.9  and  3.3.10  is  a  bit  artificial,  since  K  has 
margin  0(7^)  in  the  similarity  sense  just  because  1  —  47^  <  K{xi,Xj)  <  1.  Normalizing  K  to  [—1, 1] 
we  would  obtain  a  similarity  function  that  has  margin  0(1).  However,  this  “problem”  can  be  simply  fixed 
by  adding  fhe  symmefric  poinfs  on  fhe  lower  semi- sphere: 


and  by  changing  pi 
wifh  (3.33)): 


X5  =  (7,7, -\/l  -27^),  2/5  =  1, 

xq  =  (7, -7,-y/l  -272),  2/6  =  1, 


P5  =  4  -  6 

Pq  =  e 


X7 


=  (-7,7, -Vl  -  27^),  2/7  = -1,  P7  =  e 


1 


Xs  =  (-7,-7,  -\/l  -  27^),  2/8  =  -1,  P8=^- 


4  —  e  and  p4  =  4  —  e.  The  classification  margins  on  X2  and  X3  are  now  (compare 


E[w{x)yK{x2,x)]  =  \^{--e){wi-W4-W5  +  ws)  +  e{w2-W3-WQ  +  W7)j{l-2'y'^) 

+  2e{w2  +  W3  +  We  +  1^7)7^ 

-E[t(;(x)yft:(x3,x)]  =  -  -  e){wi  -  W4,  -  W5  +  wg)  +  e{w2  -  W3  -  wq  +  {1  -  2'y‘^) 

+  2e(r(;2  +  m3  +  me  +  m7)7^ 


One  of  fhe  above  classification  margins  musf  Iherefore  be  af  mosf  2e{w2  +  m3  +  me  +  wy)j'^  <  867^. 
And  so,  even  fhough  fhe  similarify  is  “normalized”,  and  is  (0, 7) -kernel-good,  if  is  only  (e,  8e7^)-good  as 
a  similarify  function.  Proceeding  as  in  fhe  proof  of  Theorem  3.3.10  esfablishes  fhe  modified  example  is 
also  only  (e,  64e7^)-good  in  hinge  loss. 


Probabilistic  Labels 

So  far,  we  have  considered  only  learning  problems  where  the  label  ?/  is  a  deterministic  function  of  x.  Here, 
we  discuss  the  necessary  modifications  to  extend  our  theory  also  to  noisy  learning  problems,  where  the 
same  point  x  might  be  associated  with  both  positive  and  negative  labels  with  positive  probabilities. 

Although  the  learning  guarantees  are  valid  also  for  noisy  learning  problems,  a  kernel  that  is  kernel- 
good  for  a  noisy  learning  problem  might  not  be  good  as  a  similarity  function  for  this  learning  problem.  To 
amend  this,  the  definition  of  a  good  similarity  function  must  be  corrected,  allowing  the  weights  to  depend 
not  only  on  the  point  x  but  also  on  the  label  y: 

Definition  3.3.8  (Main,  Margin  Violations,  Corrected  for  Noisy  Problems)  A  similarity  function  K  is 
an  (e,  7) -good  similarity  function /or  a  learning  problem  P  if  there  exists  a  bounded  weighting  function 
w  over  X  X  {  —  (w{x' ,y')  £  [0,1]  for  all  x'  £  X,y'  £  {—1,+1})  such  that  at  least  a  1  —  e 

probability  mass  of  examples  x,  y  satisfy: 

^x',y'r~.p[yy'w{x',y')K{x,x')]  >  7.  (3.38) 

It  is  easy  to  verify  that  Theorem  3.3.3  can  be  extended  also  to  this  corrected  definition.  The  same  mapping 
4>^  can  be  used,  with  /3j  =  yiw{xi,yi),  where  yi  is  the  training  label  of  example  i.  Definition  3.3.6  and 
Theorem  3.3.4  can  be  extended  in  a  similar  way. 

With  these  modified  definitions.  Theorems  3.3.7  and  3.3.8  extend  also  to  noisy  learning  problems.  In 
the  proof  of  Theorem  3.3.8,  two  of  the  points  Xi,Xj  might  be  identical,  but  have  different  labels  y^  = 


59 


l,yj  =  —1  associated  with  them.  This  might  lead  to  two  different  weights  Wi,  Wj  for  the  same  point.  But 
since  w  is  now  allowed  to  depend  also  on  the  label,  this  does  not  pose  a  problem.  In  the  non-discrete  case, 
this  corresponds  to  defining  the  weight  as: 

w{x,  y)  =  E[q:*  |x,  y]  /  C.  (3.39) 

3.4  Learning  with  More  General  Similarity  Functions:  A  Better  Definition 

We  develop  here  a  new  notion  of  a  good  similarity  function  that  broadens  the  Balcan  -  Blum’ 06  notion  |24| 
presented  in  Section  3.3  while  still  guaranteeing  learnability.  As  with  the  Balcan  -  Blum’06  notion,  this 
new  definition  talks  in  terms  of  natural  similarity-based  properties  and  does  not  require  positive  semi¬ 
definiteness  or  reference  to  implicit  spaces.  However,  this  new  notion  improves  on  the  previous  Balcan  - 
Blum’06  definition  in  two  important  respects. 

First,  this  new  notion  provides  a  better  kernel-to-similarity  translation.  Any  large-margin  kernel  func¬ 
tion  is  a  good  similarity  function  under  the  new  definition,  and  while  we  still  incur  some  loss  in  the 
parameters,  this  loss  is  much  smaller  than  under  the  prior  definition,  especially  in  terms  of  the  final  la¬ 
beled  sample-complexify  bounds.  In  particular,  when  using  a  valid  kernel  funcfion  as  a  similarity  funclion, 
a  subsfanlial  portion  of  fhe  previous  sample-complexify  bound  can  be  fransferred  over  fo  merely  a  need 
for  nnlabeled  examples. 

Second,  we  show  fhaf  fhe  new  definifion  allows  for  good  similarity  funclions  fo  exisf  for  concepf 
classes  for  which  fhere  is  no  good  kernel.  In  particular,  for  any  concepf  class  C  and  sufficienfly  uncon- 
cenfrafed  disfribufion  D,  we  show  fhere  exisfs  a  similarify  funcfion  under  our  definifion  wifh  paramefers 
yielding  a  labeled  sample  complexify  bound  of  0(^  log  IC'D  fo  achieve  error  e,  mafching  fhe  ideal  sample 
complexify  for  a  generic  hypofhesis  class.  In  fad,  we  also  exfend  fhis  resulf  fo  classes  of  finife  VC- 
dimension  rafher  fhan  finife  cardinalify.  In  confrasf,  we  show  fhere  exisf  classes  C  such  fhaf  under  fhe 
uniform  disfribufion  over  fhe  insfance  space,  fhere  is  no  kernel  wifh  margin  8 / \/\C\  for  all  /  G  C  even  if 
one  allows  0.5  average  hinge-loss.  Thus,  fhe  margin-based  guaranfee  on  sample  complexify  for  learning 
such  classes  wifh  kernels  is  H(|C|)-  This  extends  work  of  [50|  and  1 109|  who  give  hardness  resulfs  wifh 
comparable  margin  bounds,  buf  af  much  lower  error  rates.  |209|  provide  lower  bounds  for  kernels  wifh 
similar  error  rates,  buf  fheir  resulfs  hold  only  for  regression  (nol  hinge  loss).  Nofe  fhaf  given  access  fo 
unlabeled  dafa,  any  similarify  funcfion  under  fhe  Balcan  -  Blum’06  definition  |24|  can  be  converted  fo 
a  kernel  funclion  wifh  approximalely  fhe  same  paramefers.  Thus,  our  lower  bound  for  kernel  funclions 
applies  fo  fhaf  definifion  as  well.  These  resulfs  eslablish  a  gap  in  fhe  represenlalional  power  of  similarify 
funclions  under  our  new  definifion  relative  fo  fhe  represenlalional  power  of  eifher  kernels  or  similarify 
funclions  under  fhe  old  definition. 

Bolh  fhis  new  definition  and  fhe  Balcan  -  Blum’06  definifion  are  based  on  fhe  idea  of  a  similarify 
funcfion  being  good  for  a  learning  problem  if  fhere  exisfs  a  non-negligible  subsef  R  of  “represenlalive 
poinfs”  such  fhaf  mosl  examples  x  are  on  average  more  similar  fo  fhe  represenlalive  poinls  of  fheir  own 
label  fhan  fo  fhe  represenlalive  poinls  of  fhe  olher  label.  (Formally,  fhe  “represenlaliveness”  of  an  example 
may  be  given  by  a  weighl  belween  0  and  1  and  viewed  as  probabilistic  or  fraclional.)  However,  fhe 
previous  Balcan  -  Blum’06  definition  combined  fhe  Iwo  quantifies  of  inferesl — fhe  probability  mass  of 
represenlalive  poinfs  and  fhe  gap  in  average  similarify  fo  represenlalive  poinls  of  each  label — info  a  single 
margin  paramefer.  The  new  nofion  keeps  fhese  quantifies  dislincl,  which  lums  oul  fo  make  a  subslanfial 
difference  bolh  in  lerms  of  broadness  of  applicabilily  and  in  lerms  of  fhe  labeled  sample  complexify 
bounds  fhaf  resulf. 

Nofe  fhaf  we  distinguish  belween  labeled  and  unlabeled  sample  complexities:  while  fhe  lofal  num¬ 
ber  of  examples  needed  depends  polynomially  on  fhe  fwo  quanlifies  of  inferesl,  fhe  number  of  labeled 


60 


examples  will  turn  out  to  depend  only  logarithmically  on  the  probability  mass  of  the  representative  set 
and  therefore  may  be  much  smaller  under  the  new  definition.  This  is  especially  beneficial  in  sifuafions  as 
described  in  Chapfer  2  in  which  unlabeled  dafa  is  plenliful  buf  labeled  dafa  is  scarce,  or  fhe  disfribufion 
is  known  and  so  unlabeled  dafa  is  free.  We  discuss  in  defail  fhe  relation  fo  fhe  model  in  Chapfer  2  in 
Section  3.5 

Anofher  way  fo  view  fhe  distinction  befween  fhe  fwo  nofions  of  similarify  is  fhaf  we  now  require 
good  predictions  using  a  weighf  funcfion  wifh  expecfafion  bounded  by  1,  rafher  fhan  supremum  bounded 
by  1:  compare  fhe  old  Definition  3.3.5  and  fhe  varianf  of  fhe  new  definition  given  as  Definition  3.4.4 
(We  do  in  facf  sfill  have  a  bound  on  fhe  supremum  which  is  much  larger,  buf  Ibis  bound  only  affecls  fhe 
labeled  sampled  complexify  logarifhmically.)  In  Theorem  3.4.13  we  make  fhe  connecfion  befween  fhe 
fwo  versions  of  fhe  new  definifion  explicif. 

Condifioning  on  a  subsef  of  represenfafive  poinfs,  or  equivalenfly  bounding  fhe  expecfafion  of  fhe 
weighf  funcfion,  allows  us  fo  base  our  leamabilify  resulfs  on  Li -regularized  linear  learning.  The  acfual 
learning  rule  we  gel,  given  in  Equation  (3.49),  is  very  similar,  and  even  idenfical,  fo  learning  rules  sug- 
gesfed  by  various  aufhors  and  commonly  used  in  pracfice  as  an  alternative  fo  Supporf  Vector  Machines 
[54,  127,  185,  192,  198|.  Here  we  give  a  firm  fheorefical  basis  to  fhis  learning  rule,  wifh  explicif  learning 
guaranfees,  and  relafe  if  fo  simple  and  infuifive  properfies  of  fhe  similarify  funcfion  or  kernel  used  (see  fhe 
discussion  af  fhe  end  of  Section  3.4.2). 

3.4.1  New  Notions  of  Good  Similarity  Functions 

In  fhis  secfion  we  provide  new  nofions  of  good  similarify  functions  generalizing  fhe  main  definifions 
in  Section  3.3  (Definitions  3.3.5  and  3.3.6)  fhaf  we  prove  have  a  number  of  imporlanl  advanfages.  For 
simplicily  in  presenlalion,  for  mosl  of  fhis  secfion  we  will  consider  only  learning  problems  where  fhe 
label  y  is  a  deferminisfic  funcfion  of  x.  For  such  learning  problems,  we  can  use  y{x)  fo  denote  fhe  label 
of  poinf  X. 

In  fhe  Definifions  3.3.5  and  3.3.6  in  secfion  3.3,  a  weighf  w{x')  G  [0, 1]  was  used  in  defining  fhe 
quantify  of  inferesf,  namely  E(^^i  y/-^^p[y'w{x')K{x,x')].  Here,  if  will  instead  be  more  convenienf  to 
fhink  of  w{x)  as  fhe  expected  value  of  an  indicator  random  variable  R{x)  G  {0, 1}  where  we  will  view 
fhe  (probabilistic)  sef  {x  :  R{x)  =  1}  as  a  sef  of  “represenfafive  poinfs”.  Formally,  for  each  x  £  X,  R{x) 
is  a  discrefe  random  variable  over  {0,1}  and  we  will  fhen  be  sampling  from  fhe  join!  disfribufion  of  fhe 
form 

Pr(x,  y,  r)  =  Pr(x,  y)  Pr(i?(x)  =  r)  (3.40) 

in  fhe  discrefe  case  or 

p(x,  y,  r)  =  p{x,  y)  Pr(i?(a:)  =  r)  (3.41) 

in  fhe  continuous  case,  where  p  is  a  probabilify  density  funcfion  of  P. 

Our  new  definifion  is  now  as  follows. 

Definition  3.4.1  (Main,  Margin  Violations)  A  similarity  function  K  is  an  (e,  7,  r)-good  similarity  func¬ 
tion/or  a  learning  problem  P  if  there  exisfs  an  extended  distribution  P{x,  y,  r)  defined  as  in  3.40  or  3.41 
such  that  the  following  conditions  hold: 

1.  A  1  —  e  probability  mass  of  examples  (x,y)  ^  P  satisfy 

^{x',y',r')r.^p[yy'K{x,x')  I  r  =  1]  >  7  (3.42) 


2.  [P  =  1]  >  T. 


61 


If  the  representative  set  R  is  50/50  positive  and  negative  (i.e.,  Rt: [x\y' ,r')[y'  =  =  1]  =  1/2),  we 

can  interpret  the  condition  as  stating  that  most  examples  x  are  on  average  27  more  similar  to  random 
representative  examples  x'  of  their  own  label  than  to  random  representative  examples  x'  of  the  other  label. 
The  second  condition  is  that  at  least  a  r  fraction  of  the  points  should  be  representative. 

We  also  consider  a  hinge-loss  version  of  the  definition: 


Definition  3.4.2  (Main,  Hinge  Loss)  A  similarity  function  K  is  an  (e,  7,  T)-good  similarity  function  in 
hinge  loss /or  a  learning  problem  P  if  there  exists  an  extended  distribution  P{x,  y,  r)  defined  as  in  3.40 
or  3.41  such  that  the  following  conditions  hold: 

1.  We  have 


[l-yg{x)h]. 


<  e, 


(3.43) 


where  g{x)  =  ,y' y)[y' K{x,x')  \  r'  =  1]. 

2-  Rr(x',y'y)  [r'  =  1]  >  r. 

It  is  not  hard  to  see  that  an  (e,7)-good  similarity  function  under  Definitions  3.3.5  and  3.3.6  is  also 
an  (e,  7, 7)-good  similarity  function  under  Definitions  3.4.1  and  3.4.2,  respectively.  In  the  reverse  direc¬ 
tion,  an  (e,7,T)-good  similarity  function  under  Definitions  3.4.1  and  3.4.2  is  an  (e,7r)-good  similarity 
function  under  Definitions  3.3.5  and  3.3.6  (respectively).  Specifically: 

Theorem  3.4.1  If  K  is  an  {e,y)-good  similarity  function  under  Definitions  3.3.5  and  3.3.6.  then  K  is 
also  an  (e,  7,  'y)-good  similarity  function  under  Definitions  3.4.1  and  3.4.2.  respectively. 

Proof:  If  we  sef  Rr (^x' ,y' yii't''  =  1  |  x')  =  w{x'),  we  get  that  in  order  for  any  point  x  to  fulfill  equation 
(3.6),  we  must  have 


RT^(x',y',r'){r'  =  1)  =  R'x'iwix')]  >  ,y')[yy' w{x') K {x,  x')]  >  7. 

Furthermore,  for  any  x,  y  for  which  (3.6)  is  satisfied,  we  have 

R‘(x',y',r')[yy'K{x,x')\r'  =  1]  =  E(^x',y')[yy'K{x,x')w{x')]/RTi^x',y'y)  {r  =  1) 

>  R‘(x',y')[yyK{x,x')w{x)]>J. 


Theorem  3.4.2  If  K  is  an  (e,  7,  T)-good  similarity  function  under  Definitions  3.4.1  and  3.4.2.  then  K  is 
an  (e,  yT)-good  similarity  function  under  Definitions  3.3.5  and  3.3.6  (respectively). 

Proof:  Setting  w{x')  =  Rr y  ^y/ =  1  I  ^0  have  for  any  x,  y  satisfying  (3.42)  that 


R‘{x'y)[yy'K{x,x')w{x)]  =  E(^x'yy)[yyK{x,x)r'  =  l] 

=  R^{x'yys)[yy'K{x,x)\r'  =  1]  Rv(^x'yy){r'  =  1)  >  yr. 

A  similar  calculation  establishes  the  correspondence  for  the  hinge  loss.  ■ 

As  we  will  see,  under  both  old  and  new  definitions,  the  number  of  labeled  samples  required  for  learn¬ 
ing  grows  as  1/7^.  The  key  distinction  between  them  is  that  we  introduce  a  new  parameter,  r,  that 
primarily  affects  the  number  of  unlabeled  examples  required.  This  decoupling  of  the  number  of  labeled 
and  unlabeled  examples  enables  us  to  handle  a  wider  variety  of  situations  with  an  improved  labeled  sam¬ 
ple  complexity.  In  particular,  in  translating  from  a  kernel  to  a  similarity  function,  we  will  find  that  much 
of  the  loss  can  now  be  placed  into  the  r  parameter. 


62 


In  the  following  we  prove  three  types  of  results  about  this  new  notion  of  similarity.  The  first  is 
that  similarity  functions  satisfying  these  conditions  are  sufficient  for  learning  (in  polynomial  time  in  the 
case  of  Definition  3.4.2),  with  a  sample  size  of  0(;:p  ln(::^))  labeled  examples  and  0(;^)  unlabeled 
examples.  This  is  particularly  useful  in  settings  where  unlabeled  data  is  plentiful  and  cheap — such  settings 
are  increasingly  common  in  learning  applications  |82,  172] — or  for  distribution-specific  learning  where 
unlabeled  dafa  may  be  viewed  as  free. 

The  second  main  fheorem  we  prove  is  fhaf  any  class  C,  over  a  sufficienlly  unconcenfrafed  disfribu- 
fion  on  examples,  has  a  (0, 1,  l/(2|C'|))-good  similarify  funcfion  (under  eifher  definilion  3.4.1  or  3.4.2), 
whereas  fhere  exisf  classes  C  fhaf  have  no  (0.5,  S/y^jCD-good  kernel  funcfions  in  hinge  loss.  This  pro¬ 
vides  a  clear  separation  befween  fhe  similarify  and  kernel  notions  in  terms  of  the  parameters  controlling 
labeled  sample  complexity.  The  final  main  fheorem  we  prove  is  fhaf  any  large-margin  kernel  function 
also  satisfies  our  similarify  definifions,  wifh  subsfanfially  less  loss  in  fhe  paramefers  confrolling  labeled 
sample  complexify  compared  fo  fhe  Balcan  -  Blum’06  definifions.  For  example,  if  iT  is  a  (0,7)-good 
kernel,  fhen  if  is  an  (e',  e'7^)-good  similarify  funcfion  under  Definifions  3.3.5  and  3.3.6,  and  fhis  is  fighf 
[195],  resulting  in  a  sample  complexify  of  0(l/(7^e^))  fo  achieve  error  e.  However,  we  can  show  K  is 
an  (e',  7^,  e')-good  similarify  funcfion  under  fhe  new  definifion,^  resulfing  in  a  sample  complexify  of  only 

0(1/(7M)- 

3.4.2  Good  Similarity  Functions  Allow  Learning 

The  basic  approach  proposed  for  learning  using  a  similarify  funcfion  is  similar  fo  fhaf  in  Secfion  3.3  and  in 
[24|.  Firsf,  a  feafure  space  is  consfrucfed,  consisting  of  similarities  fo  randomly  chosen  landmarks.  Then, 
a  linear  predicfor  is  soughf  in  fhis  feafure  space.  However,  for  fhe  previous  Balcan  -  Blum’06  definifions 
(Definifions  3.3.5  and  3.3.6  in  Secfion  3.3),  we  used  guaranfees  for  large  L2-margin  in  fhis  feafure  space, 
whereas  under  fhe  new  definifions  we  will  be  using  guaranfees  abouf  large  Li-margin  in  fhe  feafure  space.^ 

Affer  recalling  fhe  nofion  of  an  Li-margin  and  ifs  associafed  learning  guaranfee,  we  firsf  esfablish 
fhaf,  for  an  (e,  7,r)-good  similarify  funcfion,  fhe  feafure  map  consfrucfed  using  0(l/(r7^))  landmarks 
indeed  has  (wifh  high  probabilify)  a  large  Li-margin  separafor.  Using  fhis  resulf,  we  fhen  obfain  a  learning 
guaranfee  by  following  fhe  sfrafegy  ouflined  above. 

In  speaking  of  Li -margin  7,  we  refer  fo  separafion  wifh  a  margin  7  by  a  unif-Li-norm  linear  separafor, 
in  a  unif- Loo -bounded  feafure  space.  Formally,  lef  (/>  :  x  1— >  <^(x),  (j){x)  G  wifh  ||(()(x)||oo  <  1  be  a 
mapping  of  fhe  dafa  fo  a  d-dimensional  feafure  space.  We  say  fhaf  a  linear  predicfor  a  G  R'^,  achieves 
error  e  relative  fo  Li-margin  7  if  Pr(3,  y(3,))(y(x)(a!,  (/>(x))  >  7)  >  1  —  e  (fhis  is  fhe  sfandard  margin 
consfrainf)  and  ||a||i  =  1. 

Given  a  d-dimensional  feafure  map  under  which  fhere  exisfs  some  (unknown)  zero-error  linear  sepa¬ 
rator  wifh  Li -margin  7,  we  can  wifh  high  probabilify  1  —  6  efficienfly  learn  a  predicfor  wifh  error  af  mosf 
Cacc  using  O  ^  )  examples.  This  can  be  done  using  fhe  Winnow  algorifhm  wifh  a  sfandard  online- 

fo-bafch  conversion  [  163|.  If  we  can  only  guaranfee  fhe  exisfence  of  a  separafor  wifh  error  e  >  0  relafive 
fo  Li-margin  7,  fhen  a  predicfor  wifh  error  e  -|-  Cacc  can  be  fheorefically  learned  (wifh  high  probabilify 
1  —  5)  from  a  sample  of  O {{log{d / 6)) / examples  by  minimizing  fhe  number  of  Li-margin  7 
violations  on  fhe  sample  |21 1 1. 

We  are  now  ready  fo  slate  fhe  main  resulf  enabling  learning  using  good  similarify  funcfions: 

^Formally,  the  translation  produces  an  (e',  7^/c,  e'c)-good  similarity  function  for  some  c  <  1.  However,  smaller  values  of  c 
only  improve  the  bounds. 

^  Note  that  in  fact  even  for  the  previous  Balcan  -  Blum’06  definitions  we  could  have  used  guarantees  for  large  Li-margin 
in  this  feature  space;  however  for  the  new  definitions  we  cannot  necessarily  use  guarantees  about  large  L2 -margin  in  the  feature 
space. 


63 


Theorem  3.4.3  Let  K  be  an  (e,  7,  T)-good  similarity  function  for  a  learning  problem  P.  Let  S  = 
X2, . . . ,  x^}  be  a  (potentially  unlabeled)  sample  of 

d="(log(2/i)+8!5ie^) 

landmarks  drawn  from  P.  Consider  the  mapping  :  X  ^  R‘^  defined  as  follows:  =  K{x,  x[), 

i  G  {1, . . . ,  d}.  Then,  with  probability  at  least  1  —  5  over  the  random  sample  S,  the  induced  distribution 
4>^{P)  in  has  a  separator  of  error  at  most  e  +  5  relative  to  Li  margin  at  least 

Proof:  First,  note  that  since  \K{x,  x)|  <  1  for  all  x,  we  have  ||(?i‘^(x)||oo  <  1- 

For  each  landmark  x[,  let  r[  he  a  draw  from  the  distrihution  given  hy  R{x'f}.  Consider  the  linear 
separator  a  G  R*^,  given  hy  ai  =  y(x')r'/di  where  di  =  ’^^e  number  of  landmarks  with 

R{x)  =  1.  This  normalization  ensures  ||a||i  =  1. 

We  have,  for  any  x,  y(x): 


y(x){a,(t)^{x)) 

This  is  an  empirical  average  of  di  terms 


YA=iy{x)y{x[)r[K{x,x[) 

di 


(3.44) 


— 1  <  y{x)y{x)K{x,x')  <  1 


for  which  R(x')  =  1.  For  any  x  we  can  apply  Hoeffding’s  inequality,  and  obtain  that  with  probability  at 
least  1  —  5“^ ! 2  over  the  choice  of  S,  we  have: 


y(x){a,4>^{x))  >  (x,  x')y(x')y(x)\R(x')] 


(3.45) 


Since  the  above  holds  for  any  x  with  probability  at  least  1  —  (5^/2  over  S,  it  also  holds  with  probability 
at  least  1  —  <5^/2  over  the  choice  of  x  and  S.  We  can  write  this  as: 


Pr  ( violation ) 

X'^P 


<  5^/2 


(3.46) 


where  “violation”  refers  to  violating  (3.45 ).  Applying  Markov’s  inequality  we  get  that  with  probability  at 
least  1  —  5/2  over  the  choice  of  S,  at  most  5  fraction  of  points  violate  (3.45).  Recalling  Definition  3.4.1, 
at  most  an  additional  e  fraction  of  the  points  violate  (3.42).  But  for  the  remaining  1  —  e  —  5  fraction  of  the 
points,  for  which  both  (3.45 )  and  (3.42)  hold,  we  have: 


y{x){a,4>^{x))  >  7 


(3.47) 


To  bound  the  second  term  we  need  an  upper  bound  on  di,  the  number  of  representative  landmarks.  The 
probability  of  each  of  the  d  landmarks  being  representative  is  at  least  r  and  so  the  number  of  representative 
landmarks  follows  a  Binomial  distribution,  ensuring  di  >  8  log(  1  /d)  /7^  with  probability  at  least  1  —  5/2. 


When  this  happens,  we  have 
least  1  —  e  —  (5  of  the  points: 


21og(A) 


di 


—  <7/2.  We  get  then,  that  with  probability  at  least  1  —  5,  for  at 

y{x){a,(t)^{x))  >  7/2.  (3.48) 


For  the  realizable  (e  =  0)  case,  we  obtain: 


64 


Corollary  3.4.4  If  K  is  a  (0, 7,  T)-good  similarity  function  then  with  high  probability  we  can  efficiently 
find  a  predictor  with  error  at  most  Caccfrom  an  unlabeled  sample  of  size  du  =  O  (^737^  and  from  a  labeled 

sample  of  size  di  =  o(  Y 

\  "7  ^acc  J 

Proof:  We  have  proved  in  Theorem  3.4.3  that  if  K  is(0, 7,  T)-good  similarity  function,  then  with  high 
probability  there  exists  a  low-error  large-margin  (at  least  4)  separator  in  the  transformed  space  under 
mapping  Thus,  all  we  need  now  to  learn  well  is  to  draw  a  new  fresh  sample  5,  map  it  into  the 
transformed  space  using  and  then  apply  a  good  algorithm  for  learning  linear  separators  in  the  new 
space  that  produces  a  hypothesis  of  error  at  most  eacc  with  high  probability.  In  particular,  remember  that 
the  vector  a  has  error  at  most  <5  at  Li  margin  7/2  over  (j)^{P),  where  the  mapping  produces  examples 
of  Loo  norm  at  most  1.  In  order  to  enjoy  the  better  learning  guarantees  of  the  separable  case,  we  will  set 
5u  small  enough  so  that  no  bad  points  appear  in  the  sample.  Specifically,  if  we  draw 

du  =  d{y,  6u,  '^)  =  f  ^log(2/5„)  -f 

unlabeled  examples  then  with  probability  at  least  1—Su  over  the  random  sample  S,  the  induced  distribution 
f^{P)  in  has  a  separator  of  error  at  most  du  relative  to  Li  margin  at  least  7/2.  So,  if  we  draw 
0(72^“  {du/d))  new  labeled  examples  then  with  high  probability  1  —  dfinai  these  points  are  linearly 

separable  at  margin  7/2,  where  6 final  =  In  {du/d),  where  ci  is  a  constant. 

Setting  5u  =  eacc7^'r(5/(c2  In  (l/(eacc77‘5)))  (where  C2  is  a  constant)  we  get  that  high  probability 
1  —  d/2  these  points  are  linearly  separable  at  margin  7/2  in  the  new  feature  space.  The  Corollary  now 
follows  from  the  Li -margin  learning  guarantee  in  the  separable  case,  discussed  earlier  in  the  section.  ■ 

For  the  most  general  (e  >  0)  case.  Theorem  3.4.3  implies  that  by  following  our  two-stage  approach, 
first  using  du  =  0^757^  unlabeled  examples  as  landmarks  in  order  to  construct  and  then  using  a 

fresh  sample  of  size  di  =  Oi  2^2  In  du  )  to  learn  a  low-error  Li-margin  7  separator  in  </>'^(-),  we  have: 

Corollary  3.4.5  If  K  is  a  (e,  7,  T)-good  similarity  function  then  by  minimizing  Li  margin  violations  we 
can  find  a  predictor  with  error  at  most  e  -|-  Caccfrom  an  unlabeled  sample  of  size  du  =  O  and  from 

a  labeled  sample  of  size  di  =  0[  7  )■ 

The  procedure  described  above,  although  well  defined,  involves  a  difficull  opfimizafion  problem:  min¬ 
imizing  fhe  number  of  Li -margin  violations.  In  order  fo  obfain  a  compufafionally  fracfable  procedure,  we 
consider  fhe  hinge-loss  instead  of  fhe  margin  error.  In  a  feafure  space  wifh  ||(/)(a:)||oo  <  1  as  above, 
we  say  fhaf  a  unif-Li-norm  predictor  a,  ||q;||i  =  1,  has  expected  hinge-loss  E  [[1  —  y{x){a,  (j){x)) / ^]jf\ 
relafive  fo  Li-margin  7.  Now,  if  we  know  fhere  is  some  (unknown)  predicfor  wifh  hinge-loss  e  relafive 
Li -margin  7,  fhan  a  predicfor  wifh  error  e  -|-  Cacc  can  be  learned  (wifh  high  probabilify)  from  a  sample  of 
O  {log  d / {y'^ eloc))  examples  by  minimizing  fhe  empirical  average  hinge-loss  relafive  fo  Li-margin  7  on 
fhe  sample  [21 IJ. 

Before  proceeding  fo  discussing  fhe  opfimizafion  problem  of  minimizing  fhe  average  hinge-loss  rela¬ 
tive  to  a  fixed  Li-margin,  lef  us  esfablish  fhe  analogue  of  Theorem  3.4.3  for  fhe  hinge-loss: 

Theorem  3.4.6  Assume  that  K  is  an  (e,  7,  T)-good  similarity  function  in  hinge-loss  for  a  learning  prob¬ 
lem  P.  For  any  ei  >  0  and  0  <  A  <  7ei/4  let  S  =  {xi,X2,  ■  ■  ■  ,Xd}  be  a  sample  of  size  d  = 
7  (log(2/5)  -|-  161og(2/5)/(ei7)^)  drawn  from  P.  With  probability  at  least  1  —  d  over  the  random  sample 
S,  the  induced  distribution  (j)^{P)  in  R^,  for  as  defined  in  Theorem  3.4.3  has  a  separator  achieving 
hinge-loss  at  most  e  -\-  ei  at  margin  7. 


65 


Proof:  We  use  the  same  construction  as  in  Theorem  3.4.3  ■ 


Corollary  3.4.7  K  is  an  (e,^,T)-good  similarity  function  in  hinge  loss  then  we  can  efficiently  find  a 
predictor  with  error  at  most  e  +  Cacc  from  an  unlabeled  sample  of  size  du  =  0[  —  )  and  from  a 

\  O'  ^acc"^  ) 


labeled  sample  of  size  di  =  o(  2“  ) . 

\  'y  ^acc  / 


For  the  hinge-loss,  our  two  stage  procedure  boils  down  to  solving  the  following  optimization  problem 


w.r.t.  a: 


minimize 


s.t. 


<^1 

E 

i=l 

E 

i=i 


1  -  '^ajy{xi)K{xi,Xj) 

i=i 

«il  <  1/7 


(3.49) 


This  is  a  linear  program  and  can  thus  be  solved  in  polynomial  time,  establishing  the  efficiency  in  Corollary 
3.4.7 

We  can  in  fact  use  results  in  1 1 15 1  to  extend  Corollary  3.4.7  a  bit  and  get  a  better  bound  as  follows: 
Corollary  3.4.8  If  K  is  a  {eacc/8,'y,T)-good  similarity  function  then  with  high  probability  we  can  effi¬ 
ciently  find  a  predictor  with  error  at  most  Caccfrom  an  unlabeled  sample  of  size  du  =  O  and  from 

a  labeled  sample  of  size  di  =  o(  ) . 

\  I  ^acc  J 

An  optimization  problem  similar  to  (3.49),  though  usually  with  the  same  set  of  points  used  both 
as  landmarks  and  as  training  examples,  is  actually  fairly  commonly  used  as  a  learning  rule  in  practice 
[54,  127,  185|.  Such  a  learning  rule  is  typically  discussed  as  an  alternative  to  SVMs.  In  fact,  [198] 
suggest  the  Relevance  Vector  Machine  (RVM)  as  a  Bayesian  alternative  to  SVMs.  The  MAP  estimate 
of  the  RVM  is  given  by  an  optimization  problem  similar  to  (3.49),  though  with  a  loss  function  different 
from  the  hinge  loss  (the  hinge-loss  cannot  be  obtained  as  a  log-likelihood).  Similarly,  [  192[  suggests 
Norm-Penalized  Leveraging  Procedures  as  a  boosting-like  approach  that  mimics  SVMs.  Again,  although 
the  specific  loss  funcfions  studied  by  [192[  are  different  from  the  hinge-loss,  the  method  (with  a  norm 
exponent  of  1,  as  in  [  192  [’s  experiments)  otherwise  corresponds  to  a  coordinate-descent  minimization  of 
(3.49).  In  both  cases,  no  learning  guarantees  are  provided. 

The  motivation  for  using  (3.49)  as  an  alternative  to  SVMs  is  usually  that  the  Li -regularization  on 
a  leads  to  sparsity,  and  hence  to  “few  support  vectors”  (although  [207[,  who  also  discuss  (3.49),  argue 
for  more  direct  ways  of  obtaining  such  sparsity),  and  also  that  the  linear  program  (3.49)  might  be  easier 
to  solve  than  the  SVM  quadratic  program.  However,  we  are  not  aware  of  a  previous  discussion  on  how 
learning  using  (3.49)  relates  to  learning  using  a  SVM,  or  on  learning  guarantees  using  (3.49)  in  terms  of 
properties  of  the  similarity  function  K.  Guarantees  solely  in  terms  of  the  feature  space  in  which  we  seek 
low  Li -margin  ififr  in  our  notation)  are  problematic,  as  this  feature  space  is  generated  randomly  from 
data. 

In  fact,  in  order  to  enjoy  the  SVM  guarantees  while  using  Li  regularization  to  obtain  sparsity,  some 
authors  suggest  regularizing  both  the  Li  norm  ||q:||i  of  the  coefficient  vector  a  (as  in  (3.49)),  and  the 
norm  ||/3||  of  the  corresponding  predictor  (5  =  aj(p{xj)  in  the  Hilbert  space  implied  by  K,  where 
K{x,x')  =  ((/>(x),  ^(x')),  as  when  using  a  SVM  with  iF  as  a  kernel  [  128,  179[. 

Here,  we  provide  a  natural  condition  on  the  similarity  function  K  (Definition  3.4.2),  that  justifies 
the  learning  rule  (3.49).  Furthermore,  we  show  (in  Section  3.4.4)  than  any  similarity  function  that  is 
good  as  a  kernel,  and  can  ensure  SVM  learning,  is  also  good  as  a  similarity  function  and  can  thus  also 


66 


ensure  learning  using  the  learning  rule  (3.49)  (though  possibly  with  some  deterioration  of  the  learning 
guarantees).  These  arguments  can  be  used  to  justify  (3.49)  as  an  alternative  to  SVMs. 

Before  concluding  this  discussion,  we  would  like  to  mention  that  [  1 18 1  previously  established  a  rather 
different  connection  between  regularizing  the  Li  norm  ||a||i  and  regularizing  the  norm  of  the  corre¬ 
sponding  predictor  (3  in  the  implied  Hilbert  space.  [118]  considered  a  hard-margin  S VR  (Support  Vector 
Regression  Machine,  i.e.  requiring  each  prediction  to  be  within  {y{x)  —  e,  y{x)  +  e)),  in  the  noiseless  case 
where  the  mapping  x  i— >  y{x)  is  in  the  Hilbert  space.  In  this  setting,  |118|  showed  that  a  hard-margin 
SVR  is  equivalent  to  minimizing  the  distance  in  the  implied  Hilbert  space  between  the  correct  mapping 
X  I— >  y[x)  and  the  predictions  x  i— >  Ylj  C(jK{x,  xj),  with  an  Li  regularization  term  ||a||i.  However,  this 
distance  between  prediction  functions  is  very  different  than  the  objective  in  (3.49),  and  again  refers  back 
to  the  implied  feature  space  which  we  are  trying  to  avoid. 

3.4.3  Separation  Results 

In  this  Section,  we  show  an  example  of  a  finite  concept  class  for  which  no  kernel  yields  good  learning 
guarantees  when  used  as  a  kernel,  but  for  which  there  does  exist  a  good  similarity  function  yielding  the 
optimal  sample  complexity.  That  is,  we  show  that  some  concept  classes  cannot  be  reasonably  represented 
by  kernels,  but  can  be  reasonably  represented  by  similarity  functions. 

Specifically,  we  consider  a  class  C  of  n  pairwise  uncorrelafed  funcfions.  This  is  a  finite  class  of 
cardinalify  \C\  =  n,  and  so  if  fhe  largel  belongs  fo  C  fhen  0{^  log  n)  samples  are  enough  for  learning  a 
predicfor  wifh  error  e. 

Indeed,  we  show  here  fhaf  for  any  concepf  class  C,  so  long  as  fhe  disfribufion  D  is  sufficienfly  uncon- 
cenfrafed,  fhere  exisfs  a  similarify  funcfion  fhaf  is  (0, 1,  2|^)-good  under  our  definifion  for  every  /  G  (7. 
This  yields  a  (labeled)  sample  complexify  0{  -  log  IC'D  fo  achieve  error  e,  mafching  fhe  ideal  sample  com¬ 
plexify.  In  ofher  words,  for  disfribufion-specific  learning  (where  unlabeled  dafa  may  be  viewed  as  free) 
and  finite  classes,  fhere  is  no  intrinsic  loss  in  sample-complexify  incurred  by  choosing  fo  learn  via  similar- 
ify  funcfions.  In  facl,  we  also  exfend  Ibis  resulf  fo  classes  of  bounded  VC-dimension  rafher  fhan  bounded 
cardinalify. 

In  confrasf,  we  show  fhaf  if  (7  is  a  class  of  n  funcfions  fhaf  are  pairwise  uncorrelafed  wifh  respecf  fo 
disfribufion  D,  fhen  no  kernel  is  (e,  7)-good  in  hinge-loss  for  all  /  G  (7  even  for  e  =  0.5  and  7  =  8/Vn. 
This  extends  work  of  [50,  109]  who  give  hardness  resulfs  wifh  comparable  margin  bounds,  buf  af  a  much 
lower  error  rate.  Thus,  fhis  shows  fhere  is  an  infrinsic  loss  incurred  by  using  kernels  fogefher  wifh  margin 
bounds,  since  fhis  resulfs  in  a  sample  complexify  bound  of  af  leasf  fl(|(7|),  rafher  fhan  fhe  ideal  (7 (log  |(7|)- 

We  fhus  demonsfrafe  a  gap  befween  fhe  kind  of  prior  knowledge  can  be  represented  wifh  kernels 
as  opposed  fo  general  similarify  funcfions  and  demonsfrafe  fhaf  similarify  funcfions  are  sfricfly  more 
expressive  (up  fo  fhe  degradafion  in  parameters  discussed  earlier). 

Definition  3.4.3  We  say  that  a  distribution  D  over  X  is  a-unconcenfrafed  if  the  probability  mass  on  any 
given  X  £  X  is  at  most  a. 

Theorem  3.4.9  For  any  class  finite  class  of  functions  C  and  for  any  l/\C\-unconcentrated  distribution 
D  over  the  instance  space  X,  there  exists  a  similarity  function  K  that  is  a  (0, 1,  i^j^)-good  similarity 
function  for  all  f  £  C. 

Proof:  Lef  (7  =  {/i, . . . ,  /„}.  Now,  lef  us  parfifion  X  info  n  regions  i?j  of  af  teas!  1  /(2n)  probabilify 
mass  each,  which  we  can  do  since  D  is  1 /n-unconcenfrafed.  Finally,  define  K{x,  x')  for  x'  in  Ri  fo  be 
fi[x)fi{x').  We  claim  fhaf  for  fhis  similarity  funcfion,  Ri  is  a  sef  of  “represenfafive  poinfs”  esfablishing 
margin  7  =  1  for  fargef  fi.  Specifically, 

'E[K{x,x')fi{x)fi{x')\x'  £  Ri]  =  'E[f fix) f fix') f fix) f fix')]  =  1. 


67 


Since  Pr(i?j)  >  this  implies  that  under  distribution  D,  it'  is  a  (0, 1,  ^)-good  similarity  function  for 
^11  fi  ^  C. 


Note  1:  We  can  extend  this  argument  to  any  class  C  of  small  VC  dimension.  In  particular,  for  any 
distribution  D,  the  class  C  has  an  e-cover  Ce  of  size  {1  / where  d  is  the  VC-dimension  of  C  [52|. 
By  Theorem  3.4.9,  we  can  have  a  (0, 1,  l/|Ce|)-good  similarity  function  for  the  cover  C^,  which  in  turn 
implies  an  (e,  1,  l/ICeD-good  similarity  function  for  the  original  set  (even  in  hinge  loss  since  7  =  1)- 
Plugging  in  our  bound  on  | Cel,  we  get  an  (e,  1,  e'^^'^/'^))-good  similarity  function  for  C.  Thus,  the  labeled 
sample  complexity  we  get  for  learning  with  similarity  functions  is  only  0{{d/e)  log(l/e)),  and  again  there 
is  no  intrinsic  loss  in  sample  complexity  bounds  due  to  learning  with  similarity  functions. 

Note  2:  The  need  for  the  underlying  distribution  to  be  unconcentrated  stems  from  our  use  of  this  distri¬ 
bution  for  both  labeled  and  unlabeled  data.  We  could  further  extend  our  definition  of  “good  similarity 
function”  to  allow  for  the  unlabeled  points  x'  to  come  from  some  other  distribution  D'  given  apriori,  such 
as  the  uniform  distribution  over  the  instance  space  X.  Now,  the  expectation  over  x'  and  the  probability 
mass  of  R  would  both  be  with  respect  to  D' ,  and  the  generic  learning  algorithm  would  draw  points  x[ 
from  D'  rather  than  D.  In  this  case,  we  would  only  need  D'  to  be  unconcentrated,  rather  than  D. 

We  now  prove  our  lower  bound  for  margin-based  learning  with  kernels. 

Theorem  3.4.10  Let  C  be  a  class  ofn  pairwise  uncorrelated  functions  over  distribution  D.  Then,  there 
is  no  kernel  that  for  all  f  €  C  is  (e,  'y)-good  in  hinge-loss  even  for  e  =  0.5  and  7  =  8/-y/n. 

Proof:  Let  C  =  {/i, . . . ,  We  begin  with  the  basic  fourier  setup  [162,  167  j.  Given  two  functions 
/  and  g,  define  (/,  g)  =  'Eix[f{x)g{x)\  fo  be  their  correlation  with  respect  to  distribution  D.  (This  is  their 
inner-product  if  we  view  /  as  a  vector  whose  yth  coordinate  is  f{xj)[D[xj)Y/‘^).  Because  the  functions 
fi  ^  C  stre  pairwise  uncorrelated,  we  have  (/j,  fj)  =0  for  all  i  7^  j,  and  because  the  /j  are  boolean 
functions  we  have  {fi,  fi)  =  1  for  all  i.  Thus  they  form  at  least  part  of  an  orthonormal  basis,  and  for  any 
hypothesis  h  (i.e.  any  mapping  X  — >  {±1})  we  have 

h&c 


So,  this  implies 

h&c 

or  equivalently 

Ey,6cK/i,/t)l  <l/v/^-  (3.50) 

In  other  words,  for  any  hypothesis  h,  if  we  pick  the  target  at  random  from  C,  the  expected  magnitude 
of  the  correlation  between  h  and  the  target  is  at  most  1  / ^/n. 

We  now  consider  the  implications  of  having  a  good  kernel.  Suppose  for  contradiction  that  there  exists 
a  kernel  K  that  is  (0.5,7)-good  in  hinge  loss  for  every  fi  G  C.  What  we  will  show  is  this  implies  that 
for  any  fi  G  C,  the  expected  value  of  |(/i,  fi)  \  for  a  random  linear  separator  h  in  the  f-space.  is  greater 
than  7/8.  If  we  can  prove  this,  then  we  are  done  because  this  implies  there  must  exist  an  h  that  has 
Ey-gcK/i,  /)|  >  7/8,  which  contradicts  equation  (3.50)  for  7  =  8/yTr. 

So,  we  just  have  to  prove  the  statement  about  random  linear  separators.  Let  w*  denote  the  vector 
in  the  (/)-space  that  has  hinge-loss  at  most  0.5  at  margin  7  for  target  function  fi.  For  any  example  x, 
define  jx  to  be  the  margin  of  ^{x)  with  respect  to  w*,  and  define  ax  =  sin“^(72,)  to  be  the  angular 


68 


margin  of  (/)(x)  with  respect  to  w*.^  Now,  consider  choosing  a  random  vector  h  in  the  (/)-space,  where 
we  associate  h{x)  =  sign(h  •  </>(x)).  Since  we  only  care  about  the  absolute  value  \{h,  fi)\,  and  since 
{—h,  fi)  =  — (/i,  fi),  it  suffices  to  show  that  Eh[{h,  fi)\h-w*>0]>  7/8.  We  do  this  as  follows. 

First,  for  any  example  x,  we  claim  that: 


Pr[(/i(x)  ^  fi{x)\h  •  ru*  >  0]  =  1/2  -  ax/ir.  (3.51) 

h 

This  is  because  we  look  at  the  2-dimensional  plane  defined  by  cj){x)  and  w*,  and  consider  fhe  half-circle 
of  ||/i||  =  1  such  fhaf  h  ■  w*  >  0,  fhen  (3.51)  is  fhe  portion  of  fhe  half-circle  fhaf  labels  (j){x)  incorrecfly. 
Thus,  we  have: 

Bh[err{h)\h- w*  >0]  =  -  oix/tt], 

and  so,  using  {h,  /j)  =  1  —  2  err{h),  we  have: 

Eh[(/i,  fi)  \h-w*  >0]  =  2'F,^[ax\/'n. 


Finally,  we  jusf  need  fo  relate  angular  margin  and  hinge  loss:  if  is  fhe  hinge-loss  of  (t){x),  fhen  a 
crude  bound  on  ax  is 

ax  >  7(1  -  (vr/2)La;). 

Since  we  assumed  fhaf  Ea;[L3;]  <  0.5,  we  have: 


^x[ax]  >  7(1  -  V4)- 


Puffing  fhis  fogefher  we  gef  expected  magnifude  of  correlation  of  a  random  halfspace  is  af  leasf  27(1  — 
7r/4)/7r  >  7/8  as  desired,  proving  fhe  fheorem.  ■ 

An  example  of  a  class  C  satisfying  fhe  above  conditions  is  fhe  class  of  partly  functions  over  {0,1}^®”, 
which  are  pairwise  uncorrelafed  wifh  respecl  fo  fhe  uniform  disfribulion.  Nole  fhaf  fhe  uniform  disfrib- 
ufion  is  l/|C'|-unconcenlrated,  and  fhus  Ihere  is  a  good  similarity  function.  (In  particular,  one  could  use 
K{xi,  Xj)  =  fj{xi)fj{xj),  where  fj  is  fhe  parity  function  associated  wifh  indicator  vector  xj.) 

We  can  extend  Theorem  3.4.10  fo  classes  of  large  Sfafisfical  Query  dimension  as  well.  In  parficular, 
fhe  SQ-dimension  of  a  class  C  wifh  respecl  fo  disfribulion  D  is  fhe  size  d  of  fhe  largesl  sel  of  functions 
{/i;  72,  •  •  • ,  fd}  C  C  such  fhaf  |(/j,  fj)\  <  1/d^  for  all  i  j  [63|.  In  fhis  case,  we  jusf  need  to  adjusl 
fhe  Fourier  analysis  pari  of  fhe  argumenl  fo  handle  fhe  facl  fhaf  fhe  functions  may  nol  be  complelely 
uncorrelafed. 

Theorem  3.4.11  Let  C  be  a  class  of  functions  of  SQ-dimension  d  with  respect  to  distribution  D.  Then, 
there  is  no  kernel  that  for  all  f  ^  C  is  (e,  'y)-good  in  hinge-loss  even  for  e  =  0.5  and  7  =  IQjsfd. 

Proof:  Lei  /i, . . .  ,7^  be  d  functions  in  C  such  fhaf  |(7rt7j)l  <  l/df  for  all  i  /  j.  We  can  define 
an  orthogonal  sel  of  functions  7i,72>  •  •  •  >7^  as  follows:  lei  f[  =  fi,  f 2  =  72  -  7i(72,  7i)>  and  in 
general  lef  7/  be  fhe  portion  of  fi  orfhogonal  fo  fhe  space  spanned  by  7i,  •  •  ■ ,  fi-i-  (Thai  is,  fl  =  fi  — 
pro  j  ( 7i,  span  (7i,  •  • . ,  7i-i))>  where  “proj”  is  orthogonal  projection.)  Since  fhe  7i  are  orthogonal  and  have 
lenglh  al  mosl  1,  for  any  boolean  function  h  we  have  (/i,  fl)'^  <  1  and  iherefore  Ei|(/i,  7^01  ^  IjVd- 
Finally,  since  {fi,  fj)  <  1/df  for  all  i  /  j,  one  can  show  fhis  implies  fhaf  17*  “  711  <  1/d  for  all  i.  So, 
^i\{h,fi)\  <1/ y/d  -\-l/d  <2/ y/d.  The  resl  of  fhe  argumenl  in  fhe  proof  of  Theorem  3.4. 10  now  applies 
wifh  7  =  IQjy/d.  ■ 

*So,  is  a  bit  larger  in  magnitude  than  7^;.  This  works  in  our  favor  when  the  margin  is  positive,  and  we  just  need  to  be 
careful  when  the  margin  in  negative. 


69 


For  example,  the  class  of  size-n  decision  trees  over  {0, 1}”  has  pairwise  uncorrelated  func¬ 

tions  over  the  uniform  distribution  (in  particular,  any  parity  of  log  n  variables  can  be  written  as  an  n-node 
decision  tree).  So,  this  means  we  cannot  have  a  kernel  with  margin  1  /poly{n)  for  all  size-n  decision  trees 
over  {0,  l}"^.  However,  we  can  have  a  similarity  function  with  margin  1,  though  the  r  parameter  (which 
controls  running  time)  will  be  exponentially  small. 

3.4.4  Relation  Between  Good  Kernels  and  Good  Similarity  Functions 

We  start  by  showing  that  a  kernel  good  as  a  similarity  function  is  also  good  as  a  kernel.  Specifically,  if 
a  similarity  function  K  is  indeed  a  kernel,  and  it  is  (e,7,  r)-good  as  a  similarity  function  (possibly  in 
hinge-loss),  than  it  is  also  (e,  7) -good  as  a  kernel  (respectively,  in  hinge  loss).  That  is,  although  the  notion 
of  a  good  similarity  function  is  more  widely  applicable,  for  those  similarity  functions  that  are  positive 
semidefinite,  a  good  similarity  function  is  also  a  good  kernel. 

Theorem  3.4.12  If  K  is  a  valid  kernel  function,  and  is  (e,  7,  T)-good  similarity  for  some  learning  prob¬ 
lem,  then  it  is  also  (e,  y) -kernel- good  for  the  learning  problem.  If  K  is  (e,  7,  T)-good  similarity  in  hinge 
loss,  then  it  is  also  (e,  'y) -kernel- good  in  hinge  loss. 

Proof:  Consider  a  similarity  function  K  that  is  a  valid  kernel,  i.e.  K{x,  x')  =  {(j){x),  fix'))  for  some 
mapping  ^  of  x  to  a  Hilbert  space  H.  For  any  input  distribution  and  any  probabilistic  set  of  representative 
points  R  of  the  input  we  will  construct  a  linear  predictor  /3r  G  H,  with  ||/3i?||  <  1,  such  that  similarity- 
based  predictions  using  R  are  the  same  as  the  linear  predictions  made  with  Pr. 

Define  the  following  linear  predictor  Pr  €7i: 

[y'fix')\r' =  1]. 

The  predictor  Pr  has  norm  at  most: 

Wf^RW  =  =  <max||2/(x')</'(a:')|| 

<  max  I  |(^(x')||  =  max  ^/Kix' ,  x')  <  1 

where  the  second  inequality  follows  from  |y(x')|  <  1. 

The  predictions  made  by  Pr  are: 

{Pr,  fix))  =  (E(^/yy)  [yfix')\r'  =  l] ,  fix)) 

=  [V'{fix'),fix))\r'  =  1] 

,y' ,r')  \y  ^ix-i  X  )|r  ij 

That  is,  using  Pr  is  the  same  as  using  similarity-based  prediction  with  R.  In  particular,  the  margin 
violation  rate,  as  well  as  the  hinge  loss,  with  respect  to  any  margin  7,  is  the  same  for  predictions  made 
using  either  R  or  Pr.  This  is  enough  to  establish  Theorem  3.4.12:  If  K  is  (e,  7)-good  (perhaps  for  to  the 
hinge-loss),  there  exists  some  valid  R  that  yields  margin  violation  error  rate  (resp.  hinge  loss)  at  most  e 
with  respect  to  margin  7,  and  so  Pr  yields  the  same  margin  violation  (resp.  hinge  loss)  with  respect  to  the 
same  margin,  establishing  K  is  (e,  7) -kernel-good  (resp.  for  the  hinge  loss).  ■ 

We  now  show  the  converse:  if  a  kernel  function  is  good  in  the  kernel  sense,  it  is  also  good  in  the 
similarity  sense,  though  with  some  degradation  of  the  margin.  This  degradation  is  much  smaller  than  the 
one  incurred  previously  by  the  Balcan  -  Blum’06  definitions  (and  the  proofs  in  |24|,  [195],  and  [38|). 
Specifically,  we  can  show  that  if  iT  is  a  (0, 7)-good  kernel,  then  K  is  (e,  7^,  e)-good  similarity  function 


70 


for  any  e  (formally,  it  is  (e,  7^/c,  ec)-good  for  some  c  <  1).  The  proof  is  based  on  the  following  idea. 
Say  we  have  a  good  kernel  in  hinge  loss.  Then  we  can  choose  an  appropriate  regularization  parameter 
and  write  a  “distributional  SVM”  such  that  there  exists  a  solution  vector  that  gets  a  large  fraction  of  the 
distribution  correct,  and  moreover,  the  fraction  of  support  vectors  is  large  enough.  Any  support  vector 
will  then  be  considered  a  representative  point  in  our  similarity  view,  and  the  probability  that  a  point  is 
representative  is  proportional  to  where  ai  is  dual  variable  associated  with  Xj. 

To  formally  prove  the  desired  result,  we  introduce  an  intermediate  notion  of  a  good  similarity  function. 

Definition  3.4.4  (Intermediate,  Margin  Violations)  A  similarity  function  K  is  a  relaxed  (e,  7,  M)-good 
similarity  function  for  a  learning  problem  P  if  there  exists  a  bounded  weighting  function  w  over  X, 
w{x')  G  [0,  M\  for  all  x'  G  X,  'Eix'r^p[w{x')\  <  1  such  that  at  least  al  —  e  probability  mass  of  examples 
X  satisfy: 

Epr^p[y{x)y{x')w{x')K{x,x')]  >  7.  (3.52) 


Definition  3.4.5  (Intermediate,  Hinge  Loss)  A  similarity  function  K  is  a  relaxed  (e,  7,  M)-good  simi¬ 
larity  function  in  hinge  loss/or  a  learning  problem  P  if  there  exists  a  weighting  function  w{x')  G  [0,  M] 
for  all  x'  G  X,  Pix'r^p[w{x')\  <  1  such  that 


E, 


[1  -  y{x)g{x)/y]. 


<  e, 


(3.53) 


where  g{x)  =  Eix'rs.p[y{x')'w{x')K{x,  x')]  is  the  similarity-based  prediction  made  using  w{-). 

These  intermediate  definitions  are  closely  related  to  our  main  similarity  function  definitions:  in  par¬ 
ticular,  if  iT  is  a  relaxed  (e,7,  M)-good  similarity  function  for  a  learning  problem  P,  then  it  is  also  an 
{e^yjc,  c/M)-good  similarity  function  for  some  7  <  c  <  1. 

Theorem  3.4.13  If  K  is  a  relaxed  (e,^,  M)-good  similarity  function  for  a  learning  problem  P,  then 
there  exists  7  <  c  <  1  such  that  K  is  a  (e,  7/c,  c/M)-good  similarity  function  for  P.  If  K  is  a  relaxed 
{e,^,  M)-good  similarity  function  in  hinge  loss  for  P,  then  there  exists  7  <  c  <  1  such  that  K  is  a 
{c^ylc^  c/M)-good  similarity  function  for  P. 

Proof:  First,  divide  w{x)  by  M  to  scale  its  range  to  [0, 1],  so  E[ri;]  =  c/M  for  some  c  <  1  and  the 
margin  is  now  7/M.  Define  random  indicator  R{x')  to  equal  1  with  probability  w{x')  and  0  with  proba¬ 
bility  1  —  w{x'),  and  let  the  extended  probability  P  over  X  xY  x  {0, 1}  be  defined  as  in  Equations  3.40 
or  3.41 
We  have 

T  =  Pl’(x',y',r')K  =  1]  =  Ea;'[m(x')]  =  c/M, 

and  we  can  rewrite  (3.52)  as 


E(a;'y,r')[y(a:)y /(r' =  l)iT(x,x')]  >  'y/M.  (3.54) 

Finally,  divide  both  sides  of  (3.54)  by  r  =  c/M,  producing  the  conditional  ,y' ,r')[y{x)y{x')K{x,  x')  \ 
r'  =  1]  on  the  LHS  and  a  margin  of  7/c  on  the  RHS.  The  case  of  hinge-loss  is  identical.  ■ 

Note  that  since  our  guarantees  for  (e,  7,  T)-good  similarity  functions  depend  on  r  only  through 
a  decrease  in  r  and  a  proportional  increase  in  7  (as  when  c  <  1  in  Theorem  3.4.13)  only  improves 
the  guarantees.  However,  allowing  flexibility  in  this  tradeoff  will  make  the  kernel-to-similarity  function 
translation  much  easier. 


We  will  now  establish  that  a  similarity  function  K  that  is  good  as  a  kernel,  is  also  good  as  a  similarity 
function  in  this  intermediate  sense,  and  hence,  by  Theorem  3.4.13,  also  in  our  original  sense.  We  begin 
by  considering  goodness  in  hinge-loss,  and  will  return  to  margin  violations  at  the  end  of  the  Section. 


71 


Theorem  3.4.14  If  K  is  (eo,  'y)-good  kernel  in  hinge  loss  for  learning  problem  (with  deterministic  labels), 
then  it  is  also  a  relaxed  (eo  +  ei,  2ei+eo  )'good  similarity  in  hinge  loss  for  the  learning  problem, 

for  any  ei  >  0. 

Proof:  We  initially  only  consider  finite  discrete  distributions,  where: 


Pi{xi,yi )  =  Pi 


(3.55) 


for  z  =  1 . . .  n,  with  XlILi  P*  =  1  for  ^  7^  j- 

Let  K  be  any  kernel  function  that  is  (eo,  7) -kernel  good  in  hinge  loss.  Let  f  be  the  implied  feature 
mapping  and  denote  fi  =  f{xi).  Consider  the  following  weighted-SVM  quadratic  optimization  problem 
with  regularization  parameter  C: 


1 

minimize  -||/3|p  PC'^pfl  -  yi(/3,  (?)*)]+  (3.56) 

i=\ 

The  dual  of  this  problem,  with  dual  variables  a,,  is: 

maximize  ^i) 

i  ^  ij  (3.57) 

subject  to  0  <  ccj  <  Cpi 


There  is  no  duality  gap,  and  furthermore  the  primal  optimum  f3*  can  be  expressed  in  terms  of  the  dual 
optimum  q*\  (5*  =  a*yi4>i. 

Since  K  is  (eo,  7) -kernel-good  in  hinge-loss,  there  exists  a  predictor  ||/3o||  =  1  with  average-hinge 
loss  eo  relative  to  margin  7.  The  primal  optimum  (3*  of  (3.56),  being  the  optimum  solution,  then  satisfies: 


3*  I  I  2 


+  CY2pi[l-yi{f3*,(l)i)]+  <  ]^\\-(3o\\^ +  CY2pi[l-yi(^l3o,fSf 

T  ,■  \  7  / 

1 

V 


+  CE 


[1  -  y{  ^/3o,0(a;)  )]  + 


Since  both  terms  on  the  left  hand  side  are  non-negative,  each  of  them  is  bounded  by  the  right  hand  side, 
and  in  particular: 

-  Vi{P* ,(l}i)]+  <  ^  +  C'eo 


Dividing  by  C  we  get  a  bound  on  the  average  hinge-loss  of  the  predictor  (3*,  relative  to  a  margin  of  one: 

E[[l  -  y{!3\f{x))]+]  <  ^  +  eo  (3.58) 

We  now  use  the  fact  that  (3*  can  be  written  as  /3*  =  ct^yifi  with  0  <  a*  <  Cpi.  Let  us  consider 
the  weights 


Wi  =  w{xi)  =  a*/{Api) 


(3.59) 


72 


So,  Wi  <  ^  and  E[u;] 


Furthermore,  since  we  have  no  duality  gap  we  also  have 


= 5ii'3'ii" + cEwii  -  w(/j*.  A>i+, 

i  i 

so  <  :^  +  CeQ. 

So,  we  have  for  every  x,  y: 


yEx',y'  [w{x)y  K{x,x)] 


y'^Piw{xi)yiK{x,Xi) 

i 

y'^Pia*yiK{x,Xi)/{Api) 

i 

i 

y{p*,(j){x))/A 


Multiplying  by  A  and  using  (3.58): 

Ex,y  [  [  1  -  AyEx'^y'  [w{x')y'K{x,  x')]  ]+  ]  =  Ex,y[  [  1  -  y{P*,(p{x))  ]+  ]  <  +  eo  (3.60) 

Since  Wi  <  E[z/;]  =  ,  and  a*  <  ^  -\-  Ceo,  and  we  want  E[r(;]  <  1,  we  need  to  impose 

that  +  Ceo^  ^  <  1-  We  also  want  Wi  G  [0,  M],  so  we  also  have  the  constraint  ^  <  M.  Choosing 
=  2ei\eo’  ^  ~  72 ,  and  C  =  l/(2ei7^)  we  get  an  average  hinge-loss  of  eo  -|-  ei  at  margin  1/A 


Ex,y  [  [  1  -  yEx',y'  [w{x')y'K{x,  x')]  / (1/A)  ]+  ]  <  eo  -f  ei  (3.61) 


as  desired.  This  establishes  that  if  K  is  (eo,  7)-good  kernel  in  hinge  loss  then  it  is  also  a  relaxed  (eo  + 

^1)  i+eQ/2ei  1  2ei+eo  similarity  in  hinge  loss,  for  any  ei  >  0,  at  least  for  finite  discrete  distributions. 

To  extend  the  result  also  to  non-discrete  distributions,  we  can  consider  the  variational  “infinite  SVM” 
problem  and  apply  the  same  arguments,  as  in  1 195 1  and  in  Section  3.3  ■ 


Interpretation  The  proof  of  theorem  3.4.14  shows  the  following.  Assume  that  K  is  (0, 7)-good  kernel. 
Assume  that  r  is  our  desired  error  rate.  Then  we  can  choose  a  regularization  parameter  (7  =  1/ (27^  •  r) 
for  the  “distributional  SVM”  (Eq.  3.56)  such  that  there  exists  a  solution  vector  that  gets  a  (1  —  r)  fraction 
of  the  distribution  correct,  and  moreover,  the  number  of  support  vectors  is  at  least  7^  •  r  fraction  of  the 
whole  distribution;  so,  we  do  end  up  spread  out  a  bit  the  support  vectors  of  the  SVM  in  Eq.  3.56  Any 
support  vector  will  then  be  considered  a  representative  point  in  our  similarity  view,  and  the  probability 
that  a  point  is  representative  is  proportional  to  aijpi. 

Note  however  that  if  the  iT  is  a  good  kernel,  then  there  might  exist  multiple  different  good  sets  of 
representative  points  ;  and  the  argument  in  theorem  3.4.14  shows  the  existence  of  such  a  set  based  on  an 
SVM  argument.  ^ 

We  can  now  use  the  hinge-loss  correspondence  to  get  a  similar  result  for  the  margin- violation  defini¬ 
tions: 

®In  fact,  the  original  proof  that  a  good  Itemel  is  a  good  similarity  function  in  the  Balcan  -  Blum’06  sense  which  appeared 
in  1 24 1  was  based  on  a  different  Perceptron  based  argument. 


73 


Theorem  3.4.15  If  K  is  {eQ,^)-good  kernel  for  a  learning  problem  (with  deterministic  labels),  then  it  is 
also  a  relaxed  (cq  +  ei,  7^/2,  )-good  similarity  function  for  the  learning  problem,  for  any  ei  >  0. 

Proof:  If  K  is  (0,7)-good  as  a  kernel,  it  is  also  (0,7)  good  as  a  kernel  in  hinge  loss,  and  we  can  apply 
Theorem  3.4.14  to  obtain  that  K  is  also  (eo/2, 71,  Ti)-good,  where  71  =  7^  and  n  =  1/ei.  We  can  then 
bound  the  number  of  margin  violations  at  72  =  71 /2  by  half  the  hinge  loss  at  margin  71  to  obtain  the 
desired  result. 

If  K  is  only  (e,  7)-good  as  a  kernel,  we  follow  a  similar  procedure  to  that  described  in  1 195 1  and  in 
Section  3.3 ,  and  consider  a  distribution  conditioned  only  on  those  places  where  there  is  no  error.  Returning 
to  the  original  distribution,  we  must  scale  the  weights  up  by  an  amount  proportional  to  the  probability  of 
the  event  we  conditioned  on  (i.e.  the  probability  of  no  margin  violation).  This  yields  the  desired  bound. 


Note:  We  also  note  that  if  we  want  our  Definitions  3.4. 1  and  Definition  3.4.2  to  include  the  usual  notions 
of  good  kernel  functions,  we  do  need  to  allow  the  set  {x  :  R{x)  =  1}  to  be  probabilistic.  To  see  this,  let 
us  consider  the  following  example. 


Xi 

=  (v^ 

7^7), 

yi  =  1, 

Pi 

X2 

=  (-\/i' 

-  7^7,), 

y2  =  1, 

P2 

X3 

7^-7,), 

ys  =  -1, 

P3 

X4 

=  (-3/1" 

1 

to 

1 

y4  =  -1, 

Pi 

1  and  (small)  probability  0  <  e 

<  h 

for  some  (small)  0  <  7  < 
sphere  (i.e.  ||xj||  =  1  and  so  K{xi,Xj)  =  {xi,Xj)  <  1),  and  are  clearly  separated  by  /3  =  (0, 1)  with  a 
margin  of  7.  The  standard  inner-product  kernel  is  therefore  (0, 7) -kernel-good  on  this  distribution.  Note 
however  that  for  any  r,  in  order  to  get  iT  to  be  a  (0, 7,  r)-good  similarity  function  we  need  to  allow  R 
to  be  probabilistic.  This  can  be  easily  verified  by  a  case  analysis.  Clearly  we  cannof  have  R  confain  jusf 
one  poinf.  Also,  we  cannof  R  be  only  {xi,  X4}  since  X3  will  fail  fo  safisfy  fhe  condifion.  Similarly  wrf 
{x2,  X3}.  Ofher  cases  can  be  easily  verified  as  well. 

One  can  use  fhe  same  example  in  order  fo  show  fhaf  we  need  fo  consider  w{x)  G  [0, 1]  rafher  fhan 
G  {0, 1}  in  fhe  confexf  Definifions  3.3.5  and  Definitions  3.3.6 


3.4.5  Tighteness 

We  show  here  fhaf  in  facf  we  need  fo  allow  0(7^)  loss  in  fhe  kernel  fo  similarify  franslafion.  Specifically: 

Theorem  3.4.16  (Tightness,  Margin  Violations)  For  any  e,  t,  and  7  there  exists  a  learning  problem  and 
a  kernel  function  K,  which  is  (0,  'y)-kemel-good  for  the  learning  problem,  but  which  cannot  be  (e,  7,  r)- 
good  similarity  for  7  >  27^. 

Proof:  Assume  that  X  G  R^,  for  d  >  Assume  that  Xi  has  all  coordinates  0  except  for  coordinates  1 

and  i  which  are  set  to  y{xi)y  and  \/l  —  7^,  respectively.  It  is  easy  to  verify  that  the  standard  inner-product 
kernel  is  a  (0, 7) -kernel-good  on  this  distribution  -  it  is  separated  by  /3  =  (1,  0, . . . ,  0)  with  a  margin  of 
7.  We  also  clearly  have  \K{xi,Xj)\  <  7^  for  all  i  j  and  K{x,x)  =  1  for  all  x,  which  implies  that 
^{x',y',r')r~.p[yy' K{x,  x')  I  r'  =  1]  <  27^  for  any  extended  distribution  P{x,  y,  r).  This  then  implies  the 
desired  conclusion.  ■ 


74 


Theorem  3.4.17  (Tightness,  Hinge  Loss)  For  any  e  <  r,  and  7  there  exists  a  learning  problem  and  a 
kernel  function  K,  which  is  {O,'y)-kemel-good  in  hinge  loss  for  the  learning  problem,  but  which  cannot 
be  (e,  7,  T)-good  similarity  in  hinge  loss  for  7  >  47^. 

Proof:  The  same  example  as  in  Theorem  3.4.16  gives  us  the  desired  conclusion. 

Let  g{x)  =  ,y' ,r')[y' K{x,x')  |  r'  =  1]  be  defined  as  in  Definition  3.4.2  We  clearly  have  g{x)  G 

[—27^,27^].  So,  clearly  for  7  >  47^  we  have  [1  —  y(x)p(x)/7]+  >  [1  —  y(x)7^/(27^)]+  >  1/2.  This 
then  implies  the  desired  conclusion.  ■ 


3.4.6  Learning  with  Multiple  Similarity  Functions 

We  consider  here  as  in  Section  3.3.4  the  case  of  learning  with  multiple  similarity  functions.  Suppose  that 
rather  than  having  a  single  similarity  function,  we  were  instead  given  n  functions  Tfi, ...,  Kn,  and  our 
hope  is  that  some  convex  combination  of  them  will  satisfy  Definition  3.4.1  Is  this  sufficient  to  be  able  to 
learn  well?  The  following  generalization  of  Theorem  3.4.3  shows  that  this  is  indeed  the  case.  (The  analog 
of  Theorem  3.4.6  can  be  derived  similarly.) 

Theorem  3.4.18  Suppose  Ki, . . . ,  Kn  are  similarity  functions  such  that  some  (unknown)  convex  com¬ 
bination  of  them  is  (e,y,T)-good.  For  any  6  >  0,  let  S  =  X2,  • . . ,  be  a  sample  of  size 

d  =  16 drawn  from  P.  Consider  the  mapping  :  X  ^  defined  as  follows:  f^fx)  = 
(it:i(x,x/),  .  .  .  ,Kn{x,xf),.  .  .,Ki{x,x'^),..  .  ,Kn{x,x'^)). 

With  probability  at  least  1  —  6  over  the  random  sample  S,  the  induced  distribution  (p^{P)  in  has 
a  separator  of  error  at  most  e  6  at  Li,  L^a  margin  at  least  7/2. 

Proof:  Let  K  =  aiKi  +  . . .  +  UnKn  be  an  (e,  7,  r)-good  convex-combination  of  the  Ki.  By  Theorem 
3.4.3,  had  we  instead  performed  the  mapping:  defined  as 

f^{x)  =  {K{x,  xi), . . . ,  K{x,  Xd)), 

fhen  wifh  probabilify  1  —  5,  fhe  induced  disfribufion  f^{P)  in  R'^  would  have  a  separafor  of  error  af  mosf 
e+(5  af  margin  af  leasf  7/2.  Lef  jd  be  fhe  vector  corresponding  to  such  a  separator  in  fhaf  space.  Now,  lef  us 
converf  fd  into  a  vector  in  RJ^'^  by  replacing  each  coordinate  fdj  wifh  fhe  n  values  {aifdj, . . . ,  anfdj).  Call 
fhe  resulting  vector  /3.  Notice  fhaf  by  design,  for  any  x  we  have  (^fd,  (j)^{x)'^  =  f^{x)^.  Furthermore, 

ll/3||i  =  ||/3||i-  Thus,  fhe  vector  fd  under  disfribufion  (p^{P)  has  fhe  same  properties  as  fhe  vector  [d  under 
(j)^{P).  This  implies  fhe  desired  resulf.  ■ 

Note  fhaf  we  gel  significanlly  heifer  bounds  here  fhan  in  Section  3.3.4  and  in  [24 1,  since  fhe  margin 
does  nol  drop  by  a  factor  of  since  we  use  an  Li  based  learning  algorilhm. 

3.5  Connection  to  the  Semi-Supervised  Learning  Setting 

We  discuss  here  how  we  can  conned  fhe  framework  in  fhis  chapler  wifh  fhe  Semi-Supervised  Learning 
model  in  Chapler  2  The  approach  here  does  have  a  similar  flavor  to  fhe  approach  in  in  Chapler  2 ,  however, 
af  a  lechnical  level,  fhe  final  guarantees  and  learning  procedures  are  somewhal  differenl. 

Given  a  similarily  function  K  lef  us  define  Ck  as  fhe  sel  of  funcfions  of  fhe  form 

/a  =  ^  a{xi)K{-,Xi). 

Xi^X 


75 


Clearly,  in  general  Ck  may  have  infinite  capacity.  Our  assumptions  on  the  similarity  function,  e.g.,  the 
assumption  in  Definition  3.3.5  can  be  interpreted  as  saying  that  the  target  function  has  unlabeled  error  e  at 
margin  7,  where  the  unlabeled  error  rate  of  a  function  /„  specified  by  coefficients  a{xi)  is  defined  as 

ervuniifa)  =  1  -  x{fa,P)  =  Pr  [\'E^/[K{x,x')a{x')]  \  <  7]. 

Note  that  here  we  can  define  x{fa,x)  =  1  if  \Eixi[K{x,  x')a{x')]  \  <  7  and  0  otherwise. 

Let  us  define  and  let  d^{f,  g)  =  [f{x)  /  ^(a;)]-  What  we  are  effectively 

doing  in  Section  3.4  is  the  following.  Given  a  fixed  7,  we  extract  a  ((i,  5/2) -randomized  approximate 
cover  of  Ck  with  respect  to  distance  In  particular,  the  guarantee  we  get  is  that  for  any  function 

fa  with  probability  at  least  1  —  5/2,  we  can  find  a  function  fa  in  the  cover  such  that  d^{fa,  fa)  <  5. 
Since  K  is  (e,  7)-good  in  the  sense  of  Definition  3.3.5,  it  follows  that  there  exist  a  function  /„  such  that 
erruniifa)  +  err^{fa)  <  e,  where 

err^ifa)  =  Pr  [f{x)  /  y{x)]  Pr  [x{f,  x)  =  1]. 

X'^P^  X 

Since  we  extract  a  (5,  5/2) -randomized  approximate  cover  of  Ck,  it  follows  that  with  high  probability,  at 
least  1  —  5/2,  we  can  find  a  function  fa  such  that  err{fa)  <  ervuniifa)  +  erry^{fa)  +  5  .  Once  we  have 
constructed  the  randomized  approximate  cover,  we  then  in  a  second  stage  use  labeled  examples  to  learn 
well. 

So,  in  the  case  studied  in  this  chapter,  the  hypothesis  space  may  have  an  infinite  capacity  before 
performing  the  inference.  In  the  training  process,  in  a  first  stage,  we  first  use  unlabeled  in  order  to  extract 
a  much  smaller  set  of  functions  with  the  property  that  with  high  probability  the  target  is  well  approximated 
by  one  the  functions  in  the  smaller  class.  In  a  second  stage  we  then  use  labeled  examples  to  learn  well. 
(Note  that  our  compatibility  assumption  implies  an  upper  bound  on  the  best  labeled  error  we  could  hope 
for.) 

For  the  hinge  loss  definition  3.3.6,  we  need  to  consider  a  cover  according  to  the  distance 

dx{f,s)  =  E[|/(x)  -5(x)|/7. 


3.6  Conclusions 

The  main  contribution  of  this  chapter  is  to  develop  a  theory  of  learning  with  similarity  functions — namely, 
of  when  a  similarity  function  is  good  for  a  given  learning  problem — that  is  more  general  and  in  terms  of 
more  tangible  quantities  than  the  standard  theory  of  kernel  functions.  We  provide  a  definition  that  we  show 
is  both  sufficient  for  learning  and  satisfied  by  the  usual  large-margin  notion  of  a  good  kernel.  Moreover, 
the  similarity  properties  we  consider  do  not  require  reference  to  implicit  high-dimensional  spaces  nor  do 
they  require  that  the  similarity  function  be  positive  semi-definite.  In  this  way,  we  provide  the  first  rigorous 
explanation  showing  why  a  kernel  function  that  is  good  in  the  large-margin  sense  can  also  formally  be 
viewed  as  a  good  similarity  function,  thereby  giving  formal  justification  to  the  standard  intuition  about 
kernels.  We  prove  that  our  main  notion  of  a  “good  similarity  function”  is  strictly  more  powerful  than 
the  traditional  notion  of  a  large-margin  kernel.  This  notion  relies  upon  Li  regularized  learning,  and  our 

’^By  capacity  of  a  set  of  functions  here  we  mean  a  distribution  independent  notion  of  dimension  of  the  given  set  of  functions, 
e.g.,  VC-dimension. 

"Given  a  class  of  functions  C,  we  define  an  {a,  /t)-cover  of  C  with  respect  to  distance  d  to  be  a  probability  distribution  over 
sets  of  functions  C  such  that  for  any  f  £  C  with  probability  at  least  1  —  a,  the  randomly  chosen  C  from  the  distribution  contains 
/  such  that  d(/,  /)  <  /?■ 


76 


separation  result  is  related  to  a  separation  result  between  what  is  leamable  with  Li  vs.  L2  regularization. 
In  a  lower  bound  of  independent  interest,  we  show  that  if  C  is  a  class  of  n  pairwise  uncorrelated  functions, 
then  no  kernel  is  (e,  7)-good  in  hinge-loss  for  all  /  G  C  even  for  e  =  0.5  and  7  =  S/^/n. 

From  a  practical  perspective,  the  results  of  Section  3.3  and  3.4  suggest  that  if  K  is  in  fact  a  valid 
kernel,  we  are  probably  better  off  using  it  as  a  kernel,  e.g.  in  an  SVM  or  Perceptron  algorithm,  rather 
than  going  through  the  transformation  of  Section  3.3.3  However,  faced  with  a  non-positive-semidefinite 
similarity  function  (coming  from  domain  experts),  the  transformation  of  Theorem  3.3.3  might  well  be 
useful.  In  fact,  Liao  and  Noble  have  used  an  algorithm  similar  to  the  one  we  propose  in  the  context  of 
protein  classification  [  160|.  Furthermore,  a  direct  implication  of  our  results  is  that  we  can  indeed  think  (in 
the  design  process)  of  the  usefulness  of  a  kernel  function  in  terms  of  more  intuitive,  direct  properties  of 
the  data  in  the  original  representation,  without  need  to  refer  to  implicit  spaces. 

Finally,  our  algorithms  (much  like  those  of  [3 1 1)  suggest  a  natural  way  to  use  kernels  or  other  similarity 
functions  in  learning  problems  for  which  one  also  wishes  to  use  the  native  features  of  the  examples.  For 
instance,  consider  the  problem  of  classifying  a  stream  of  documents  arriving  one  at  a  time.  Rather  than 
running  a  kemelized  learning  algorithm,  one  can  simply  take  the  native  features  (say  the  words  in  the 
document)  and  augment  them  with  additional  features  representing  the  similarity  of  the  current  example 
with  each  of  a  pre-selected  set  of  initial  documents.  One  can  then  feed  the  augmented  example  into  a 
standard  unkemelized  online  learning  algorithm.  It  would  be  interesting  to  explore  this  idea  further. 

It  would  be  interesting  to  explore  whether  the  lower  bound  could  be  extended  to  cover  margin  vio¬ 
lations  with  a  constant  error  rate  e  >  0  rather  than  only  hinge-loss.  In  addition,  it  would  be  particularly 
interesting  to  develop  even  broader  natural  notions  of  good  similarity  functions,  that  allow  for  functions 
that  are  not  positive-semidefinite  and  yet  provide  even  better  kernel-to-similarity  translations  (e.g.,  not 
squaring  the  margin  parameter). 

Subsequent  Work:  Inspired  by  our  work  in  |24|,  Wang  et.  al  [208 1  have  recently  analyzed  different, 
alternative  sufficient  conditions  for  learning  via  pairwise  functions.  In  particular,  Wang  et.  al  [208] 
analyze  unbounded  dissimilarity  functions  which  are  invariant  to  order  preserving  transformations.  They 
provide  conditions  that  they  prove  are  sufficient  for  learning,  though  they  may  not  include  all  good  kernel 
functions. 

On  a  different  line  of  inquiry,  we  have  used  this  approach  [40]  for  analyzing  similarity  functions  in 
the  context  of  clustering  (i.e.  learning  from  purely  unlabeled  data).  Specifically,  in  [40]  we  ask  whaf 
(sfronger)  properfies  would  be  sufficienf  fo  allow  one  fo  produce  an  accurafe  hypofhesis  wifhouf  any 
label  information  al  all.  We  show  lhal  if  one  relaxes  fhe  objeclive  (for  example,  allows  Ihe  algorilhm 
fo  produce  a  hierarchical  cluslering  such  lhal  some  pruning  is  close  fo  fhe  correcl  answer),  Ihen  one  can 
define  a  number  of  inferesling  graph-fheorelic  and  game-fheorefic  properfies  of  similarify  funclions  lhal 
are  sufficienf  fo  clusfer  well.  We  presenf  fhis  in  delail  in  Chapler  4 


77 


78 


Chapter  4 


A  Discriminative  Framework  for 
Clustering  via  Similarity  Functions 


Problems  of  clustering  data  from  pairwise  similarity  information  are  ubiquitous  in  Machine  Learning  and 
Computer  Science.  Theoretical  treatments  often  view  the  similarity  information  as  ground-truth  and  then 
design  algorithms  to  (approximately)  optimize  various  graph-based  objective  functions.  However,  in  most 
applications,  this  similarity  information  is  merely  based  on  some  heuristic;  the  ground  truth  is  really  the 
unknown  correct  clustering  of  the  data  points  and  the  real  goal  is  to  achieve  low  error  on  the  data.  In  this 
work,  we  develop  a  theoretical  approach  to  clustering  from  this  perspective.  In  particular,  motivated  by 
our  work  in  Chapter  3  that  asks  “what  natural  properties  of  a  similarity  (or  kernel)  function  are  sufficient 
to  be  able  to  learn  well?”  we  ask  “what  natural  properties  of  a  similarity  function  are  sufficient  to  be  able 
to  cluster  well?” 

To  study  this  question  we  develop  a  theoretical  framework  that  can  be  viewed  as  an  analog  for  clus¬ 
tering  of  the  discriminative  models  for  Supervised  classification  (i.e.,  the  Statistical  Learning  Theory 
framework  and  the  PAC  learning  model),  where  the  object  of  study,  rather  than  being  a  concept  class,  is 
a  class  of  (concept,  similarity  function)  pairs,  or  equivalently,  a  property  the  similarity  function  should 
satisfy  with  respect  to  the  ground  truth  clustering.  Our  notion  of  property  is  similar  to  the  large  margin 
property  for  a  kernel  or  the  properties  given  in  Definitions  3.3. L  3.3.5,  3.3.6,  3.4.1  or  3.4.2  for  supervised 
learning,  though  we  will  need  to  consider  stronger  conditions  since  we  have  no  labeled  data. 

We  then  analyze  both  algorithmic  and  information  theoretic  issues  in  our  model.  While  quite  strong 
properties  are  needed  if  the  goal  is  to  produce  a  single  approximately-correct  clustering,  we  find  fhaf  a 
number  of  reasonable  properties  are  sufficienl  under  fwo  nafural  relaxafions:  (a)  lisf  clusfering:  analo¬ 
gous  fo  fhe  nofion  of  lisf-decoding,  fhe  algorifhm  can  produce  a  small  lisf  of  clusterings  (which  a  user 
can  selecf  from)  and  (b)  hierarchical  clusfering:  fhe  algorifhm’s  goal  is  fo  produce  a  hierarchy  such  fhaf 
desired  clustering  is  some  pruning  of  fhis  free  (which  a  user  could  navigafe).  We  develop  a  notion  of 
fhe  clustering  complexity  of  a  given  properly  (analogous  fo  fhe  nofion  of  e-cover  examined  in  Chapler  2), 
fhaf  characferizes  ifs  informafion-fheorefic  usefulness  for  clusfering.  We  analyze  fhis  quantify  for  several 
nafural  game-fheorefic  and  leaming-fheorefic  properfies,  as  well  as  design  new  efficienf  algorifhms  fhaf 
are  able  fo  lake  advanlage  of  Ihem.  Our  algorifhms  for  hierarchical  clustering  combine  recenl  learning- 
Iheorelic  approaches  wilh  linkage-slyle  melhods.  We  also  show  how  our  algorifhms  can  be  exlended  fo 
fhe  inductive  case,  i.e.,  by  using  jusf  a  consfanl-sized  sample,  as  in  properly  lesling.  The  analysis  here 
uses  regularify-lype  resulfs  of  1 1 13  |  and  [  14|. 


79 


4.1  Introduction 


Clustering  is  an  important  problem  in  the  analysis  and  exploration  of  data.  It  has  a  wide  range  of  ap¬ 
plications  in  data  mining,  computer  vision  and  graphics,  and  gene  analysis.  It  has  many  variants  and 
formulations  and  it  has  been  extensively  studied  in  many  different  communities. 

In  the  Algorithms  literature,  clustering  is  typically  studied  by  posing  some  objective  function,  such 
as  fc-median,  min-sum  or  /c-means,  and  then  developing  algorithms  for  approximately  optimizing  this 
objective  given  a  data  set  represented  as  a  weighted  graph  |83,  138,  146|.  That  is,  the  graph  is  viewed  as 
“ground  truth”  and  then  the  goal  is  to  design  algorithms  to  optimize  various  objectives  over  this  graph. 
However,  for  most  clustering  problems  such  as  clustering  documents  by  topic  or  clustering  web-search 
results  by  category,  ground  truth  is  really  the  unknown  true  topic  or  true  category  of  each  object.  The 
construction  of  the  weighted  graph  is  just  done  using  some  heuristic:  e.g.,  cosine-similarity  for  clustering 
documents  or  a  Smith-Waterman  score  in  computational  biology.  In  all  these  settings,  the  goal  is  really 
to  produce  a  clustering  that  is  as  accurate  as  possible  on  the  data.  Alternatively,  methods  developed  both 
in  the  algorithms  and  in  the  machine  learning  literature  for  learning  mixtures  of  distributions  |8,  19,  91, 
95,  147,  205 1  explicitly  have  a  notion  of  ground-truth  clusters  which  they  aim  to  recover.  However,  such 
methods  are  based  on  very  strong  assumptions:  they  require  an  embedding  of  the  objects  into  such  that 
the  clusters  can  be  viewed  as  distributions  with  very  specific  properties  (e.g.,  Gaussian  or  log-concave).  In 
many  real-world  situations  (e.g.,  clustering  web-search  results  by  topic,  where  different  users  might  have 
different  notions  of  what  a  “topic”  is)  we  can  only  expect  a  domain  expert  to  provide  a  notion  of  similarity 
between  objects  that  is  related  in  some  reasonable  ways  to  the  desired  clustering  goal,  and  not  necessarily 
an  embedding  with  such  strong  properties. 

In  this  work,  we  develop  a  theoretical  study  of  the  clustering  problem  from  this  perspective.  In 
particular,  motivated  by  our  work  on  similarity  functions  presented  in  Chapter  3  that  asks  “what  nat¬ 
ural  properties  of  a  given  kernel  (or  similarity)  function  K  are  sufficient  to  allow  one  to  learn  well?” 
[24,  31 ,  133,  187,  190]  we  ask  the  question  “what  natural  properties  of  a  pairwise  similarity  function  are 
sufficient  to  allow  one  to  cluster  well?”  To  study  this  question  we  develop  a  theoretical  framework  which 
can  be  thought  of  as  a  discriminative  (PAG  style)  model  for  clustering,  though  the  basic  object  of  study, 
rather  than  a  concept  class,  is  a  property  of  the  similarity  function  K  in  relation  to  the  target  concept  much 
like  the  types  of  properties  stated  in  Chapter  3 

The  main  difficulty  that  appears  when  phrasing  the  problem  in  this  general  way  is  that  if  one  defines 
success  as  outputfing  a  single  clustering  that  closely  approximates  the  correct  clustering,  then  one  needs 
to  assume  very  strong  conditions  on  the  similarity  function.  For  example,  if  the  function  provided  by  our 
expert  is  extremely  good,  say  K{x,  y)  >  1/2  for  all  pairs  x  and  y  that  should  be  in  the  same  cluster,  and 
K{x,y)  <  1/2  for  all  pairs  x  and  y  that  should  be  in  different  clusters,  then  we  could  just  use  it  to  recover 
the  clusters  in  a  trivial  way.  ^  However,  if  we  just  slightly  weaken  this  condition  to  simply  require  that  all 
points  X  are  more  similar  to  all  points  y  from  their  own  cluster  than  to  any  points  y  from  any  other  clusters, 
then  this  is  no  longer  sufficient  to  uniquely  identify  even  a  good  approximation  to  the  correct  answer.  For 
instance,  in  the  example  in  Figure  4.1 ,  there  are  multiple  clusterings  consistent  with  this  property.  Even 
if  one  is  told  the  correct  clustering  has  3  clusters,  there  is  no  way  for  an  algorithm  to  tell  which  of  the 
two  (very  different)  possible  solutions  is  correct.  In  fact,  results  of  Kleinberg  [151]  can  be  viewed  as 
effectively  ruling  out  a  broad  class  of  scale-invariant  properties  such  as  this  one  as  being  sufficient  for 
producing  the  correct  answer. 

'Correlation  Clustering  can  be  viewed  as  a  relaxation  that  allows  some  pairs  to  fail  to  satisfy  this  condition,  and 
the  algorithms  of  [11  66  84  197 1  show  this  is  sufficient  to  cluster  well  if  the  number  of  pairs  that  fail  is  small. 
Planted  partition  models  [  13  92  169 1  allow  for  many  failures  so  long  as  they  occur  at  random.  We  will  be  interested 
in  much  more  drastic  relaxations,  however. 


80 


Figure  4.1:  Data  lies  in  four  regions  A,  B,  C,  D  (e.g.,  think  of  as  documents  on  baseball,  football,  TCS, 
and  Al).  Suppose  that  K{x,  y)  =  1  if  x  and  y  belong  to  the  same  region,  K{x,  y)  =  1/2  if  x  G  A  and 
y  G  i?  or  if  X  G  C  and  y  £  D,  and  K{x,  y)  =  0  otherwise.  Even  assuming  that  all  points  are  more  similar 
to  other  points  in  their  own  cluster  than  to  any  point  in  any  other  cluster,  there  are  still  multiple  consistent 
clusterings,  including  two  consistent  3-clusterings  ((A  U  B,  C,  D)  or  (A,  B,  C  VJ  D)).  However,  there  is 
a  single  hierarchical  decomposition  such  that  any  consistent  clustering  is  a  pruning  of  this  tree. 


In  our  work  we  overcome  this  problem  by  considering  two  relaxations  of  the  clustering  objective  that 
are  natural  for  many  clustering  applications.  The  first  is  as  in  list-decoding  to  allow  the  algorithm  to 
produce  a  small  list  of  clusterings  such  that  at  least  one  of  them  has  low  error.  The  second  is  instead  to 
allow  the  clustering  algorithm  to  produce  a  tree  (a  hierarchical  clustering)  such  that  the  correct  answer  is 
approximately  some  pruning  of  this  tree.  For  instance,  the  example  in  Figure  4.1  has  a  natural  hierarchical 
decomposition  of  this  form.  Both  relaxed  objectives  make  sense  for  settings  in  which  we  imagine  the 
output  being  fed  to  a  user  who  will  then  decide  what  she  likes  best.  For  example,  with  the  tree  relaxation, 
we  allow  the  clustering  algorithm  to  effectively  say:  “1  wasn’t  sure  how  specific  you  wanfed  fo  be,  so 
if  any  of  fhese  clusfers  are  foo  broad,  jusf  click  and  1  will  splif  if  for  you.”  We  fhen  show  fhaf  wifh 
fhese  relaxations,  a  number  of  inferesfing,  nafural  leaming-fheorefic  and  game-fheorefic  properfies  can  be 
defined  fhaf  each  are  sufficienf  fo  allow  an  algorifhm  fo  clusfer  well. 

Al  Ihe  high  level,  our  framework  has  Iwo  goals.  The  firsl  is  fo  provide  advice  aboul  whal  lype  of  algo¬ 
rithms  fo  use  given  cerlain  beliefs  aboul  Ihe  relation  of  Ihe  similarily  funclion  fo  Ihe  cluslering  lask.  Thai 
is,  if  a  domain  experl  handed  us  a  similarily  funclion  fhaf  Ihey  believed  satisfied  a  cerlain  nafural  properly 
wifh  respecl  fo  Ihe  frue  cluslering,  whal  algorifhm  would  be  mosl  appropriafe  fo  use?  The  second  goal 
is  providing  advice  fo  Ihe  designer  of  a  similarily  funclion  for  a  given  cluslering  lask  (such  as  cluslering 
web-pages  by  topic).  Thai  is,  if  a  domain  expert  is  frying  up  to  come  up  wilh  a  similarily  measure,  whal 
properties  should  Ihey  aim  for? 

4.1.1  Perspective 

The  slandard  approach  in  Iheorelical  computer  science  to  clustering  is  to  choose  some  objective  function 
(e.g.,  A:-median)  and  Ihen  to  develop  algorilhms  lhal  approximately  optimize  fhaf  objective  [83,  100,  138, 
146 1 .  If  Ihe  Irue  goal  is  to  achieve  low  error  wilh  respecl  to  an  underlying  correcl  clustering  (e.g.,  a  user’s 
desired  clustering  of  search  resulls  by  topic),  however,  fhen  one  can  view  Ihis  as  implicilly  making  Ihe 
slrong  assumption  lhal  nol  only  does  Ihe  correcl  clustering  have  a  good  objective  value,  bul  also  lhal  all 
clusterings  lhal  approximately  optimize  Ihe  objective  musl  be  close  to  Ihe  correcl  clustering  as  well.  In 
Ihis  work,  we  instead  explicilly  consider  Ihe  goal  of  producing  a  clustering  of  low  error  and  fhen  ask  whal 
nafural  properties  of  Ihe  similarily  function  in  relation  to  Ihe  largel  clustering  are  sufficienl  to  allow  an 


81 


algorithm  to  do  well. 

In  this  respect  we  are  closer  to  work  done  in  the  area  of  clustering  or  learning  with  mixture  models  1 8 , 
19,  95,  147,  205 1 .  That  work,  like  ours,  has  an  explicit  notion  of  a  correct  ground-truth  clustering  of 
the  data  points  and  to  some  extent  can  be  viewed  as  addressing  the  question  of  what  properties  of  an 
embedding  of  data  into  would  be  sufficient  for  an  algorithm  to  cluster  well.  However,  unlike  our 
focus,  the  types  of  assumptions  made  are  distributional  and  in  that  sense  are  much  more  stringent  than  the 
types  of  properties  we  will  be  considering.  This  is  similarly  the  case  with  work  on  planted  partitions  in 
graphs  [13,  92,  169|.  Abstractly  speaking,  this  view  of  clustering  parallels  the  generative  classification 
setting  [  103],  while  the  framework  we  propose  parallels  the  discriminative  classification  setting  (i.e.  the 
PAC  model  of  Valiant  [201]  and  the  Statistical  Learning  Theory  framework  of  Vapnik  1 203  j  and  the  setting 
used  in  Chapters  2,3,6  and  5  of  this  thesis). 

In  the  PAC  model  for  learning  [201],  the  basic  object  of  study  is  the  concept  class,  and  one  asks 
what  natural  classes  are  efficiently  learnable  and  by  what  algorithms.  In  our  setting,  the  basic  object  of 
study  is  property,  which  can  be  viewed  as  a  set  of  (concept,  similarity  function)  pairs,  i.e.,  the  pairs  for 
which  the  target  concept  and  similarity  function  satisfy  the  desired  relation.  As  with  the  PAC  model  for 
learning,  we  then  ask  what  natural  properties  are  sufficient  to  efficiently  cluster  well  (in  either  the  tree  or 
list  models)  and  by  what  algorithms.  Note  that  an  alternative  approach  in  clustering  is  to  pick  some  specific 
algorithm  (e.g.,  fe-means,  EM)  and  analyze  condifions  for  fhaf  algorifhm  fo  “succeed”.  While  fhere  is  also 
work  in  classification  of  fhaf  fype  (e.g.,  when  does  some  heurisfic  like  ID3  work  well),  anofher  imporfanf 
aspecf  is  in  undersfanding  which  classes  of  functions  are  learnable  and  by  whaf  algorifhms.  We  sfudy 
fhe  analogous  questions  in  fhe  clustering  confexf:  whaf  properties  are  sufficienl  for  clusfering,  and  fhen 
ideally  fhe  simples!  algorifhm  fo  clusfer  given  fhaf  properfy. 

4.1.2  Our  Results 

We  provide  a  PAC-sfyle  framework  for  analyzing  whaf  properties  of  a  similarify  funclion  are  sufficienl  fo 
allow  one  fo  clusfer  well  under  fhe  above  fwo  relaxafions  (lisf  and  free)  of  fhe  clustering  objecfive.  We 
analyze  bofh  algorifhmic  and  informafion  fheorefic  quesfions  in  our  model  and  provide  resulfs  for  several 
nafural  game-fheorefic  and  learning-fheorefic  properfies.  Specifically: 

•  We  consider  a  family  of  sfabilify-based  properfies,  showing  fhaf  a  nafural  generalization  of  fhe 
“sfable  marriage”  properly  is  sufficienl  fo  produce  a  hierarchical  clusfering.  (The  properly  is  fhaf 
no  fwo  subsels  A  <Z  C,  A  d  C  of  cluslers  C  C  in  fhe  correcl  clusfering  are  bofh  more  similar 
on  average  fo  each  olher  fhan  fo  fhe  resl  of  fheir  own  clusters.)  Moreover,  a  significanlly  weaker 
nofion  of  sfabilify  is  also  sufficienl  fo  produce  a  hierarchical  clustering,  bul  requires  a  more  involved 
algorifhm. 

•  We  show  fhaf  a  weaker  “average-affracfion”  properly  (which  is  provably  nof  enough  fo  produce  a 
single  correcl  hierarchical  clusfering)  is  sufficienl  fo  produce  a  small  lisf  of  clusferings,  and  give 
generalizations  fo  even  weaker  conditions  fhaf  generalize  fhe  nolion  of  large-margin  kernel  func¬ 
tions. 

•  We  define  fhe  clustering  complexity  of  a  given  properly  (fhe  minimum  possible  lisf  lenglh  fhaf  can 
be  guaranleed  by  any  algorifhm)  and  provide  bofh  upper  and  lower  bounds  for  fhe  properfies  we 
consider.  This  notion  is  analogous  fo  notions  of  capacily  in  classification  [72,  103,  203  [  and  if 
provides  a  formal  measure  of  fhe  inherenf  usefulness  of  a  given  properly. 

•  We  also  show  fhaf  properties  implicilly  assumed  by  approximalion  algorifhms  for  slandard  graph- 
based  objecfive  funcfions  can  be  viewed  as  special  cases  of  some  of  fhe  properfies  considered  above. 

•  We  show  how  our  mefhods  can  be  extended  fo  fhe  inductive  case,  i.e.,  by  using  jusl  a  constant-sized 


82 


sample,  as  in  property  testing.  While  most  of  our  algorithms  extend  in  a  natural  way,  for  certain 
properties  their  analysis  requires  more  involved  arguments  using  regularity-type  results  of  1 14 ,  1 1 3 1 . 
More  generally,  our  framework  provides  a  formal  way  to  analyze  what  properties  of  a  similarity  function 
would  be  sufficient  to  produce  low-error  clusterings,  as  well  as  what  algorithms  are  suited  for  a  given 
property.  For  some  of  our  properties  we  are  able  to  show  that  known  algorithms  succeed  (e.g.  variations 
of  bottom-up  hierarchical  linkage  based  algorithms),  but  for  the  most  general  ones  we  need  new  algorithms 
that  are  able  to  take  advantage  of  them. 

4.1.3  Connections  to  other  chapters  and  to  other  related  work 

Some  of  the  questions  we  address  can  be  viewed  as  a  generalization  of  questions  studied  in  Chapter  3  or  in 
other  work  machine  learning  that  asks  what  properties  of  similarity  functions  (especially  kernel  functions) 
are  sufficient  to  allow  one  to  leam  well  [24,  31 ,  133,  187,  190 1.  E.g.,  the  usual  statement  is  that  if  a  kernel 
function  satisfies  fhe  property  fhaf  fhe  fargef  funcfion  is  separable  by  a  large  margin  in  fhe  implicif  kernel 
space,  fhen  learning  can  be  done  from  few  labeled  examples.  The  clustering  problem  is  more  difficulf 
because  fhere  is  no  labeled  dafa,  and  even  in  fhe  relaxafions  we  consider,  fhe  forms  of  feedback  allowed 
are  much  weaker. 

We  note  fhaf  as  in  learning,  given  an  embedding  of  dafa  info  some  mefric  space,  fhe  similarify  funcfion 
K[x,  x')  need  not  be  a  direcf  franslafion  of  disfance  like  ),  buf  rafher  may  be  a  derived  funcfion 

based  on  fhe  entire  dafasef.  For  example,  in  fhe  dijfusion  kernel  of  [  156|,  fhe  similarify  K{x,  x')  is  related 
fo  fhe  effeclive  resisfance  befween  x  and  x'  in  a  weighfed  graph  defined  from  disfances  in  fhe  original 
mefric.  This  would  be  a  nafural  similarify  funcfion  fo  use,  for  insfance,  if  dafa  lies  in  fwo  well-separafed 
pancakes. 

In  fhe  inducfive  selling,  where  we  imagine  our  given  dafa  is  only  a  small  random  sample  of  fhe  entire 
dafa  sel,  our  framework  is  close  in  spiril  fo  recenf  work  done  on  sample-based  clustering  (e.g.,  [49 1)  in 
fhe  confexl  of  clusfering  algorifhms  designed  fo  optimize  a  certain  objeclive.  Based  on  such  a  sample, 
Ihese  algorifhms  have  fo  oulpul  a  clusfering  of  fhe  full  domain  sel,  fhaf  is  evalualed  wilh  respecl  fo  fhe 
underlying  dislribulion. 

We  also  note  fhaf  fhe  assumption  fhaf  fhe  similarify  funcfion  satisfies  a  given  properly  wilh  respecl 
fo  fhe  large!  clusfering  is  analogous  fo  fhe  assumplion  considered  in  Chapler  2  fhaf  fhe  large!  satisfies  a 
cerlain  relalion  wilh  respecl  fo  fhe  underlying  dislribulion.  Thai  is,  fhe  similarity  funcfion  plays  fhe  role 
of  fhe  dislribulion  in  Chapler  2  Af  a  lechnical  level  however  fhe  resulls  are  nol  direcfly  comparable.  In 
parlicular  in  Chapter  2  we  focus  on  compafibilily  nolions  fhaf  can  be  eslimafed  from  a  finile  sample  and 
fhe  main  angle  fhere  is  underslanding  whal  is  a  good  large!  for  a  given  dislribulion  given  a  compafibilily 
relalion  and  whal  is  a  good  dislribulion  for  a  given  compafibilily  notion.  Here  we  imagine  fixing  fhe  bolh 
fhe  fargef,  and  we  are  frying  fo  undersfand  whal  is  a  good  similarify  funcfion  for  fhe  given  large!  pair. 


4.2  Definitions  and  Preliminaries 

We  consider  a  clusfering  problem  {S,  1)  specified  as  follows.  Assume  we  have  a  dafa  sel  5  of  n  objecfs, 
where  each  objecf  is  an  elemenl  of  an  absfracf  insfance  space  X.  Each  x  £  S  has  some  (unknown) 
“ground-lrulh”  label  l{x)  in  y  =  {1, . . . ,  k},  where  we  will  Ihink  of  k  as  much  smaller  lhan  n.  The  goal 
is  fo  produce  a  hypolhesis  h  :  X  of  low  error  up  fo  isomorphism  of  label  names.  Formally,  we  define 
fhe  error  of  h  fo  be  err{h)  =  mino-g^j.  [Pra;g5  [a{h{x))  /  K®)]]-  "'ill  assume  fhaf  a  fargef  error  rate 
e,  as  well  as  k,  are  given  as  inpuf  fo  fhe  algorifhm. 


83 


We  will  be  considering  clustering  algorithms  whose  only  access  to  their  data  is  via  a  pairwise  similarity 
function  K{x,  x')  that  given  two  examples  outputs  a  number  in  the  range  [—1, 1].^  We  will  say  that  K  is 
a  symmetric  similarity  function  if  K{x,  x')  =  K{x' ,  x)  for  all  x,  x' . 

Our  focus  is  to  analyze  natural  properties  that  sufficient  for  a  similarity  function  K  to  be  good  for 
a  clustering  problem  (5, 1)  which  (ideally)  are  intuitive,  broad,  and  imply  that  such  a  similarity  function 
results  in  the  ability  to  cluster  well.  Formally,  a  property  P  is  a  relation  {((,  K)]  and  we  say  that  K  has 
property  V  with  respect  to  "P  if  (f ,  iF)  G  P. 

As  mentioned  in  the  introduction,  however,  requiring  an  algorithm  to  output  a  single  low-error  clus¬ 
tering  rules  out  even  quite  strong  properties.  Instead  we  will  consider  two  objectives  that  are  natural  if 
one  assumes  the  ability  to  get  some  limited  additional  feedback  from  a  user.  Specifically,  we  consider  fhe 
following  fwo  models: 

1.  List  model:  In  fhis  model,  fhe  goal  of  fhe  algorifhm  is  fo  propose  a  small  number  of  clusterings 
such  fhaf  af  teas!  one  has  error  af  mosf  e.  As  in  work  on  properly  fesfing,  fhe  lisl  lengfh  should 
depend  on  e  and  k  only,  and  be  independenl  of  n.  This  lisl  would  Ihen  go  lo  a  domain  experl  or 
some  hypolhesis-lesling  porfion  of  fhe  syslem  which  would  Ihen  pick  oul  fhe  besl  clustering. 

2.  Tree  model:  In  fhis  model,  fhe  goal  of  fhe  algorifhm  is  lo  produce  a  hierarchical  clustering:  fhaf 
is,  a  free  on  subsels  such  fhaf  fhe  roof  is  fhe  sel  S,  and  fhe  children  of  any  node  S'  in  fhe  free  form 
a  parlilion  of  S'.  The  requiremenl  is  fhaf  Ihere  musl  exisl  a  pruning  h  of  fhe  free  (nol  necessarily 
using  nodes  all  af  fhe  same  level)  fhaf  has  error  af  mosf  e.  In  many  applications  (e.g.  documenl 
clusfering)  fhis  is  a  significanlly  more  user-friendly  oufpuf  lhan  fhe  lisl  model.  Nole  fhaf  any  given 
free  has  af  mosf  2^^  prunings  of  size  k  1 154 1,  so  fhis  model  is  af  leasl  as  slricl  as  fhe  lisl  model. 

Transductive  vs  Inductive.  Clustering  is  typically  posed  as  a  “transductive”  [203  |  problem  in  that  we  are 
asked  to  cluster  a  given  set  of  points  S.  We  can  also  consider  an  inductive  model  in  which  S  is  merely  a 
small  random  subset  of  points  from  a  much  larger  abstract  instance  space  X,  and  our  goal  is  to  produce  a 
hypothesis  h  :  X  ^  Y  of  low  error  on  X.  For  a  given  property  of  our  similarity  function  (with  respect 
to  X)  we  can  then  ask  how  large  a  set  S  we  need  to  see  in  order  for  our  list  or  tree  produced  with  respect 
to  S  to  induce  a  good  solution  with  respect  to  X.  For  clarity  of  exposition,  for  most  of  this  chapter  we 
will  focus  on  the  transductive  setting.  In  Section  4.6  we  show  how  our  algorithms  can  be  adapted  to  the 
inductive  setting. 

Realizable  vs  Agnostic.  For  most  of  the  properties  we  consider  here,  our  assumptions  are  analogous  to 
the  realizable  case  in  supervised  learning  and  our  goal  is  to  get  e-close  to  the  target  (in  a  tree  of  list)  for 
any  desired  e  >  0.  For  other  properties,  our  assumptions  are  more  like  the  agnostic  in  that  we  will  assume 
only  that  l  —  u  fraction  of  the  data  satisfies  a  certain  condition.  In  these  cases  our  goal  os  to  get  u  +  e-close 
to  the  target. 

Notation.  We  will  denote  the  underlying  ground-truth  clusters  as  Ci, ...  ,Ck  (some  of  which  may  be 
empty).  For  x  G  X,  we  use  C{x)  to  denote  the  cluster  to  which  point  x  belongs.  For  A  C 
X,  B  C  X,  let  K{A,B)  =  'Eix^j\,x'&b[K{x,x')].  We  call  this  the  average  attraction  of  A  to  B.  Let 
Kmax{A,  B)  =  maxj-g^  3,'gB  K{x,  x');  we  call  this  maximum  attraction  of  A  to  B.  Given  two  clusterings 
g  and  h  we  define  the  distance  d{g,  h)  =  min^e^j,  [Pra;g5  [a{h{x))  /  9{x)]],  i.e.,  the  fraction  of  points  in 
the  symmetric  difference  under  the  optimal  renumbering  of  the  clusters. 

We  are  interested  in  natural  properties  that  we  might  ask  a  similarity  function  to  satisfy  with  respect 
to  the  ground  truth  clustering.  For  example,  one  (strong)  property  would  be  that  all  points  x  are  more 
similar  to  all  points  x'  G  C{x)  than  to  any  x'  0  C{x)  -  we  call  this  the  strict  separation  property.  A 

^That  is,  the  input  to  the  clustering  algorithm  is  just  a  weighted  graph.  However,  we  still  want  to  conceptually 
view  K  as  di  function  over  abstract  objects. 


84 


weaker  property  would  be  to  just  require  that  points  x  are  on  average  more  similar  to  their  own  cluster 
than  to  any  other  cluster,  that  is,  K{x,  C{x)  —  {x})  >  K{x,  Ci)  for  all  Ci  /  C{x).  We  will  also  consider 
intermediate  “stability”  conditions.  For  properties  such  as  these  we  will  be  interested  in  the  size  of  the 
smallest  list  any  algorithm  could  hope  to  output  that  would  guarantee  that  at  least  one  clustering  in  the  list 
has  error  at  most  e.  Specifically,  we  define  fhe  clustering  complexity  of  a  properfy  as: 

Definition  4.2.1  Given  a  property  V  and  similarity  function  K,  define  the  (e,  A: ) -clustering  complexity 
of  the  pair  {V,  K)  to  be  the  length  of  the  shortest  list  of  clusterings  hi, . . .  ,ht  such  that  any  consistent 
k-clustering  is  e-close  to  some  clustering  in  the  list?  That  is,  at  least  one  hi  must  have  error  at  most  e. 
The  (e,  A;) -clustering  complexity  of  V  is  the  maximum  of  this  quantity  over  all  similarity  functions  K. 

The  clustering  complexity  notion  is  analogous  to  notions  of  capacity  in  classification  |72,  103,  203] 
and  it  provides  a  formal  measure  of  the  inherent  usefulness  of  a  given  property. 

Computational  Complexity.  In  the  transductive  case,  our  goal  will  be  to  produce  a  list  or  a  tree  in  time 
polynomial  in  n  and  ideally  polynomial  in  e  and  k  as  well.  We  will  indicate  when  our  running  times 
involve  a  non-polynomial  dependence  on  these  parameters.  In  the  inductive  case,  we  want  the  running 
time  to  depend  only  on  k  and  e  and  to  be  independent  of  the  size  of  the  overall  instance  space  X,  under 
the  assumption  that  we  have  an  oracle  that  in  constant  time  can  sample  a  random  point  from  X. 

In  the  following  sections  we  analyze  both  the  clustering  complexity  and  the  computational  complexity 
of  several  natural  properties  and  provide  efficient  algorithms  to  take  advantage  of  such  functions.  We 
start  by  analyzing  the  strict  separation  property  as  well  as  a  natural  relaxation  in  Section  4.3.  We  also 
give  formal  relationships  between  these  properties  and  those  considered  implicitly  by  approximation  al¬ 
gorithms  for  standard  clustering  objectives.  We  then  analyze  a  much  weaker  average-attraction  property 
in  Section  4.4  that  is  similar  to  Definition  3.3.1  in  Chapter  3  (and  which,  as  we  have  seen,  has  close  con¬ 
nections  to  large  margin  properties  studied  in  Learning  Theory  [24,  31 ,  133,  187,  190|.)  This  property  is 
not  sufficient  to  produce  a  hierarchical  clustering,  however,  so  we  then  turn  to  the  question  of  how  weak 
a  property  can  be  and  still  be  sufficient  for  hierarchical  clustering,  which  leads  us  to  analyze  properties 
motivated  by  game-theoretic  notions  of  stability  in  Section  4.5 

Our  framework  allows  one  to  study  computational  hardness  results  as  well.  While  our  focus  is  on 
getting  positive  algorithmic  results,  we  discuss  a  simple  few  hardness  examples  in  Section  4.8.1 

4.3  Simple  Properties 

We  begin  with  the  simple  strict  separation  property  mentioned  above. 

Property  1  The  similarity  function  K  satisfies  the  strict  separation  property  for  the  clustering  problem 
{S,  1)  if  all  X  are  strictly  more  similar  to  any  point  x'  G  C{x)  than  to  every  x'  0  C{x). 

Given  a  similarity  function  satisfying  the  strict  separation  property,  we  can  efficiently  construct  a  tree 
such  that  the  ground-truth  clustering  is  a  pruning  of  this  tree  (Theorem  4.3.2).  As  mentioned  above,  a 
consequence  of  this  fact  is  a  upper  bound  on  the  clustering  complexity  of  this  property.  We  begin 
by  showing  a  matching  2^^^^  lower  bound. 

Theorem  4.3.1  For  e  <  ^,  the  strict  separation  property  has  (e,  k)-clustering  complexity  at  least  2fl‘^. 

Proof:  The  similarity  function  is  a  generalization  of  the  similarity  in  the  picture  in  Figure  4. 1  Specifically, 
partition  the  n  points  into  k  subsets  {i?i, . . . ,  7?^}  of  n/k  points  each.  Group  the  subsets  into  pairs 
{{Ri,  R2),{R3,  R4),  ■  ■  ■},  axidlet  K{x,x')  =  1  if  x  and  x'  belong  to  the  same  Ri,  K{x,x')  =  1/2  if 
X  and  x'  belong  to  two  subsets  in  the  same  pair,  and  K{x,  x')  =  0  otherwise.  Notice  that  in  this  setting 

clustering  C  is  consistent  if  K  has  property  V  with  respect  to  C. 


85 


k 

there  are  22  clusterings  (corresponding  to  whether  or  not  to  split  each  pair  Ri  U  -Rj+i)  that  are  consistent 
with  Property  1  and  differ  from  each  other  on  at  least  n/k  points.  Since  e  ^  2k’  any  given  hypothesis 
clustering  can  be  e-close  to  at  most  one  of  these  and  so  the  clustering  complexity  is  at  least  2^/^.  ■ 

We  now  present  the  upper  bound. 

Theorem  4.3.2  Let  K  be  a  similarity  function  satisfying  the  strict  separation  property.  Then  we  can 
efficiently  construct  a  tree  such  that  the  ground-truth  clustering  is  a  pruning  of  this  tree. 

Proof:  If  K  is  symmetric,  then  to  produce  a  tree  we  can  simply  use  bottom  up  “single  linkage”  (i.e., 
Kruskal’s  algorithm).  That  is,  we  begin  with  n  clusters  of  size  1  and  at  each  step  we  merge  the  two  clusters 
C,  C  maximizing  Kmax{C,  C').  This  maintains  the  invariant  that  at  each  step  the  current  clustering  is 
laminar  with  respect  to  the  ground-truth:  if  the  algorithm  merges  two  clusters  C  and  C' ,  and  C  is  strictly 
contained  in  some  cluster  Cr  of  the  ground  truth,  then  by  the  strict  separation  property  we  must  have 
C  C  Cr  as  well.  If  K  is  not  symmetric,  then  single  linkage  may  fail.^  However,  in  this  case,  the 
following  “Boruvka-inspired”  algorithm  can  be  used.  Starting  with  n  clusters  of  size  1,  draw  a  directed 
edge  from  each  cluster  C  to  the  cluster  C  maximizing  Kmax{C,  C).  Then  pick  some  cycle  produced 
(there  must  be  at  least  one  cycle)  and  collapse  it  into  a  single  cluster,  and  repeat.  Note  that  if  a  cluster 
C  in  the  cycle  is  strictly  contained  in  some  ground-truth  cluster  Cr,  then  by  the  strict  separation  property 
its  out-neighbor  must  be  as  well,  and  so  on  around  the  cycle.  So  this  collapsing  maintains  laminarity  as 
desired.  ■ 

Note:  Even  though  the  strict  separation  property  is  quite  strong,  a  similarity  function  satisfying  this  prop¬ 
erty  can  still  fool  a  top-down  spectral  clustering  approach.  See  Figure  4.2  in  Section  4.8.4 

We  can  also  consider  the  agnostic  version  of  the  strict  separation  property,  where  we  require  that  K 
satisfies  strict  separation  for  most  of  the  data. 

Property  2  The  similarity  function  K  satisfies  i/-strict  separation /or  the  clustering  problem  {S,  1)  if  for 
some  S'  f  S  of  size  (1  —  o)n,  K  satisfies  strict  separation  for  {S' ,  1). 

We  can  then  show  that: 

Theorem  4.3.3  If  K  satisfies  o-strict  separation,  then  so  long  as  the  smallest  correct  cluster  has  size 
greater  than  5on,  we  can  produce  a  tree  such  that  the  ground-truth  clustering  is  v-close  to  a  pruning  of 
this  tree. 

For  a  proof  see  Section  4.7,  where  we  also  show  that  properties  implicitly  assumed  by  approximation 
algorithms  for  standard  graph-based  objective  functions  can  be  viewed  as  special  cases  of  the  z2-strict 
separation  property. 


4.4  Weaker  properties 

A  much  weaker  property  to  ask  of  a  similarity  function  is  just  that  most  points  are  noticeably  more  similar 
on  average  to  points  in  their  own  cluster  than  to  points  in  any  other  cluster.  This  is  similar  to  Defini¬ 
tion  3.3.1  in  Chapter  3  (and  which,  as  we  have  seen,  has  close  connections  to  large  margin  properties 
studied  in  Feaming  Theory  [24,  31,  133,  187,  190|.) 

Specifically,  we  define: 

''Consider  3  points  x,  y,  z  whose  correct  clustering  is  ({x},  {y,  z}).  If  K{x,  y)  =  1,  A(y,  z)  =  K{z,  y)  =  1/2, 
and  K{y,x)  =  K{z,x)  =  0,  then  this  is  consistent  with  strict  separation  and  yet  the  algorithm  will  incorrectly 
merge  x  and  y  in  its  first  step. 


86 


Property  3  A  similarity  function  K  satisfies  the  (i/,  7)-average  attraction  property  for  the  clustering 
problem  (S,  1)  if  a  1  —  u  fraction  of  examples  x  satisfy: 

K{x,  C{x))  >  K{x,  Ci)  +  7  for  alH  G  y,  i  /  l{x). 

This  is  a  fairly  natural  property  to  ask  of  a  similarity  function:  if  a  point  x  is  more  similar  on  av¬ 
erage  to  points  in  a  different  cluster  than  to  those  in  its  own,  it  is  hard  to  expect  an  algorithm  to  label 
it  correctly.  The  following  is  a  simple  clustering  algorithm  that  given  a  similarity  function  K  satisfying 
the  average  attraction  property  produces  a  list  of  clusterings  of  size  that  depends  only  on  e,  k,  and  7. 
Specifically, 

Algorithm  2  Sampling  Based  Algorithm,  List  Model 

Input:  Data  set  S,  similarity  function  K,  parameters  7,  e  >  0,  A:  G  Z"'';  N{e,  7,  k),  s(e,  7,  k). 

•  Set  £  =  0. 

•  Repeat  JV(e,  7,  k)  times 

For  k'  =  1, k  do: 

-  Pick  a  set  of  s(e,  7,  k)  random  points  from  S. 

-  Let  h  be  the  average-nearest  neighbor  hypothesis  induced  by  the  sets  1  <  i  <  kk  That  is, 
for  any  point  x  G  S',  define  h{x)  =  argmax^gj^  j./j[iF(x,  Add  h  to  C. 

•  Output  the  list  C. 


Theorem  4.4.1  Let  K  be  a  similarity  function  satisfying  the  {v,'y) -average  attraction  property  for  the 
clustering  problem  {S,l).  Using  Algorithm  2  with  the  parameters  s{€,'y,k)  =  In  and  N{e,y,k)  = 

^  ln(  j)  we  can  produce  a  list  of  at  most  (  e)  (e^))  clusterings  such  that  with  prob¬ 

ability  1  —  5  at  least  one  of  them  is  {v  -\-  e)-close  to  the  ground-truth. 


Proof:  We  say  that  a  ground-truth  cluster  is  big  if  it  has  probability  mass  at  least  ^ ;  otherwise,  we  say 
that  the  cluster  is  small.  Let  k'  be  the  number  of  “big”  ground-truth  clusters.  Clearly  the  probability  mass 
in  all  the  small  clusters  is  at  most  e/2. 

Let  us  arbitrarily  number  the  big  clusters  Ci, ,  Ck' ■  Notice  that  in  each  round  there  is  at  least  a 
probability  that  C  Ci,  and  so  at  least  a  probability  that  Rg'’  C  Ci  for  all 

i  <  k' .  Thus  the  number  of  rounds  ^  ^  Mn(  j)  is  large  enough  so  that  with  probability  at  least 

1  —  5/2,  in  at  least  one  of  the  N{e,  7,  k)  rounds  we  have  C  Ci  for  all  i  <  k' .  Let  us  fix  now  one  such 
good  round.  We  argue  nexf  fhat  the  clustering  induced  by  the  sets  picked  in  this  round  has  error  at  most 
V  -\-  e  with  probability  at  least  1  —  <5. 

Let  Good  be  the  set  of  x  in  the  big  clusters  satisfying 


K{x,  C{x))  >  K{x,  Cj)  -\-  7  for  all  j  £Y,j  /  l{x). 


By  assumption  and  from  the  previous  observations,  Pra;..^5[x  G  Good]  >  l  —  v—e/2.  Now,  fix  x  G  Good. 
Since  K{x,  x')  G  [—1, 1],  by  Hoeffding  bounds  we  have  thaf  over  fhe  random  draw  of  Rsf  condifioned 
on  Rs^  C  Cj, 


Pr 

Rs^ 


x'r.^RS 


j[K{x,x')]  -  K{x,Cj) 


> 


for  all  j  G  {1, . . .  ,k'].  By  our  choice  of  Rsf  each  of  fhese  probabilities  is  at  most  eb/Ak.  So,  for  any 
given  X  G  Good,  there  is  at  most  a  e5/4  probability  of  error  over  the  draw  of  the  sets  Rs^  ■  Since  this  is 


87 


true  for  any  x  G  Good,  it  implies  that  the  expected  error  of  this  procedure,  over  x  G  Good,  is  at  most 
6(5/4,  which  by  Markov’s  inequality  implies  that  there  is  at  most  a  (5/2  probability  that  the  error  rate  over 
Good  is  more  than  e/2.  Adding  in  the  v  +  e/2  probability  mass  of  points  not  in  Good  yields  the  theorem. 

■ 

Note  that  Theorem  4.4.1  immediately  implies  a  corresponding  upper  bound  on  the  (e,  A:)-clustering 
complexity  of  the  (e/2, 7) -average  attraction  property.  Note  that  this  bound  however  is  not  polynomial  in 
k  and  7.  We  can  also  give  a  lower  bound  showing  that  the  exponential  dependence  on  7  is  necessary,  and 
furthermore  this  property  is  not  sufficient  to  cluster  in  the  tree  model: 

Theorem  4.4.2  For  e  <  7/2,  the  (e,  k)-clustering  complexity  of  the  {0,^)-average  attraction  property  is 
at  least  k  ^  jk!\,  and  moreover  this  property  is  not  sufficient  to  cluster  in  the  tree  model. 

Proof:  Consider  ^  regions  {i2i, . . . ,  each  with  771  points.  Assume  K{x,  x')  =  1  if  x  and  x'  belong 
to  the  same  region  Ri  and  K{x,x')  =  0,  otherwise.  Notice  that  in  this  setting  all  the  k-way  partitions 
of  the  set  {iii, . . . ,  are  consistent  with  Property  3  and  they  are  all  pairwise  at  distance  at  least  yn 

from  each  other.  Since  e  <  7/2,  any  given  hypothesis  clustering  can  be  e-close  to  at  most  one  of  these 
and  so  the  clustering  complexity  is  at  least  the  sum  of  Stirling  numbers  of  the  2nd  kind  '^(I/T)  k') 

which  is  at  least  max  A:'!.  ■ 

k'<k 

Note:  In  fact,  the  clustering  complexity  bound  immediately  implies  one  cannot  cluster  in  the  tree  model 
since  for  A;  =  2  the  bound  is  greater  than  1 . 

We  can  further  extend  the  lower  bound  in  Theorem  4.4.3  to  show  the  following: 

Theorem  4.4.3  For  e  <  1/2,  the  (e,  k)-clustering  complexity  of  the  (0,  y)-average  attraction  property  is 

k 

at  least  k^^ . 

One  can  even  weaken  the  above  property  to  ask  only  that  there  exists  an  (unknown)  weighting  function 
over  data  points  (thought  of  as  a  “reasonableness  score”),  such  that  most  points  are  on  average  more  similar 
to  the  reasonable  points  of  their  own  cluster  than  to  the  reasonable  points  of  any  other  cluster.  This  is  a 
generalization  of  the  notion  of  K  being  a  kernel  function  with  the  large  margin  property  [24,  191 , 195,203] 
as  shown  in  Chapter  3 

Property  4  A  similarity  function  K  satisfies  the  (i/,  7) -average  weighted  attraction  property  for  the 
clustering  problem  {S,l)  if  there  exists  a  weight  function  m  :  X  ^  [0, 1]  such  that  a  1  —  a  fraction  of 
examples  x  satisfy: 

^x'eC(x)[w{x)K{x,x)]  >  K^>izc,[w{x)K{x,x)]  -f  7  for  all  r  eY,r  ^  l{x). 

If  we  have  K  a  similarity  function  satisfying  the  {o,  7) -average  weighted  attraction  property  for  the 
clustering  problem  {S,l),  then  we  can  again  cluster  well  in  the  list  model,  but  via  a  more  involved  cluster¬ 
ing  algorithm.  Formally  we  can  show  that: 

Theorem  4.4.4  If  K  is  a  similarity  function  satisfying  the  -average  weighted  attraction  property 

o(  ^  ) 

for  the  clustering  problem  {S,l),  we  can  produce  a  list  of  at  most  k  clusterings  such  that  with 

probability  1  —  5  at  least  one  of  them  is  e  -|-  v-close  to  the  ground-truth. 

We  defer  the  proof  of  Theorem  4.4.4  to  Section  4. 10 

A  too-weak  property:  One  could  imagine  further  relaxing  the  average  attraction  property  to  simply 
require  that  for  all  Ci,  Cj  in  the  ground  truth  we  have  K{Ci,  Ci)  >  K{Ci,  Cj)  -\-  7;  that  is,  the  average 


88 


intra-cluster  similarity  is  larger  than  the  average  inter-cluster  similarity.  However,  even  for  fc  =  2  and 
7  =  1/4,  this  is  not  sufficient  to  produce  clustering  complexity  independent  of  (or  even  polynomial  in)  n. 
In  particular,  suppose  there  are  two  regions  A,  B  of  n/2  points  each  such  that  K{x,  x')  =  1  for  x,  x'  in 
the  same  region  and  K{x,  x')  =  0  for  x,  x'  in  different  regions.  However,  suppose  Ci  contains  75%  of 
A  and  25%  of  B  and  C2  contains  25%  of  Ci  and  75%  of  C2-  Then  this  property  is  satisfied  for  7  =  1/4 
and  yet  by  classic  coding  results  (or  Chemoff  bounds),  clustering  complexity  is  clearly  exponential  in  n 
for  e  <  1/8.  Moreover,  this  implies  there  is  no  hope  in  the  inductive  (or  property  testing)  setting. 

4.5  Stability-based  Properties 

The  properties  in  Section  4.4  are  fairly  general  and  allow  construction  of  a  list  whose  length  depends  only 
on  on  e  and  k  (for  constant  7),  but  are  not  sufficient  to  produce  a  single  tree.  In  this  section,  we  show  that 
several  natural  stability-based  properties  that  lie  between  those  considered  in  Sections  4.3  and  4.4  are  in 
fact  sufficient  for  hierarchical  clustering. 

For  simplicity,  we  focus  on  symmetric  similarity  functions.  We  consider  the  following  relaxations  of 
Property  1  which  ask  that  the  ground  truth  be  “stable”  in  the  stable-marriage  sense: 

Property  5  A  similarity  ffinction  K  satisfies  the  strong  stability  property  for  the  clustering  problem  {S,  1) 
if  for  all  clusters  Cr,  Cr',  r  r'  in  the  ground-truth,  for  all  A  C  Cr,  A'  C  we  have 

K{A,Cr\A)  >  K{A,A'). 

Property  6  A  similarity  function  K  satisfies  the  weak  stability  property  for  the  clustering  problem  (S',  1) 
if  for  all  Cr,  Cr',  r  7^  r',  for  all  A  C  Cr,  A'  C  Cr',  we  have: 

•  If  A'  c  Cr'  then  either  K{A,  Cr\A)>  K{A,  A')  or  K{A',  Cr'  \  A')  >  K(A\  A). 

•  If  A'  =  Cr'  then  K{A,  Cr\A)>  K{A,  A'). 

We  can  interpret  weak  stability  as  saying  that  for  any  two  clusters  in  the  ground  truth,  there  does 
not  exist  a  subset  A  of  one  and  subset  A'  of  the  other  that  are  more  attracted  to  each  other  than  to  the 
remainder  of  their  true  clusters  (with  technical  conditions  at  the  boundary  cases)  much  as  in  the  classic 
notion  of  stable-marriage.  Strong  stability  asks  that  both  be  more  attracted  to  their  true  clusters.  To  further 
motivate  these  properties,  note  that  if  we  take  the  example  from  Figure  4. 1  and  set  a  small  random  fraction 
of  the  edges  inside  each  dark-shaded  region  to  0,  then  with  high  probability  this  would  still  satisfy  strong 
stability  with  respect  to  all  the  natural  clusters  even  though  it  no  longer  satisfies  strict  separation  (or  even 
z^-strict  separation  for  any  <  1  if  we  included  at  least  one  edge  incident  to  each  vertex).  Nonetheless, 
we  can  show  that  these  stability  notions  are  sufficient  to  produce  a  hierarchical  clustering.  We  start  by 
proving  this  for  strong  stability  here  and  then  in  Theorem  4.5.2  we  also  prove  it  for  the  weak  stability. 

Algorithm  3  Average  Linkage,  Tree  Model 

Input:  Data  set  S,  similarity  function  K.  Output:  A  tree  on  subsets. 

•  Begin  with  n  singleton  clusters. 

•  Repeat  till  only  one  cluster  remains:  Find  clusters  C,  C  in  the  current  list  which  maximize 
K{C,  C)  and  merge  them  into  a  single  cluster. 

•  Output  the  tree  with  single  elements  as  leaves  and  internal  nodes  corresponding  to  all  the  merges 
performed. 

Theorem  4.5.1  Let  K  be  a  symmetric  similarity  function  satisfying  Property  5.  Then  we  can  efficiently 
construct  a  binary  tree  such  that  the  ground-truth  clustering  is  a  pruning  of  this  tree. 


89 


Proof:  We  will  show  that  Algorithm  3  (Average  Linkage)  will  produce  the  desired  result.  Note  that  the  al¬ 
gorithm  uses  A'(C',  C)  rather  than  Kmax{C,  C)  as  in  single  linkage;  in  fact  in  Figure  4.3  (In  section  4.8.4) 
we  show  an  example  satisfying  this  property  where  single  linkage  would  fail. 

We  prove  correctness  by  induction.  In  particular,  assume  that  our  current  clustering  is  laminar  with 
respect  to  the  ground  truth  clustering  (which  is  true  at  the  start).  That  is,  for  each  cluster  C  in  our  current 
clustering  and  each  Cr  in  the  ground  truth,  we  have  either  C  C  Cr,  ox  Cr  fL  C  ox  C  Cr  =  0.  Now, 
consider  a  merge  of  two  clusters  C  and  C .  The  only  way  that  laminarity  could  fail  to  be  satisfied  after  the 
merge  is  if  one  of  the  two  clusters,  say,  C' ,  is  strictly  contained  inside  some  ground-truth  cluster  Cr  (so, 
Cr  —  C  0)  and  yet  C  is  disjoint  from  Cr-  Now,  note  that  by  Property  5.  K{C' ,  Cr  —  C)  >  K{C' ,  x) 
for  all  X  0  Cr,  and  so  in  particular  we  have  K{C' ,  Cr  —  C)  >  K{C' ,  C).  Furthermore,  K{C' ,  Cr  —  C) 
is  a  weighted  average  of  the  K{C' ,  C)  over  the  sets  C"  C  Cr  —  C^  in  our  current  clustering  and  so  at 
least  one  such  C”  must  satisfy  K{C' ,  C)  >  K{C' ,  C).  However,  this  contradicts  the  specification  of  the 
algorithm,  since  by  definition  it  merges  the  pair  C,  C  such  that  K{C' ,  C)  is  greatest.  ■ 

Theorem  4.5.2  Let  K  be  a  symmetric  similarity  function  satisfying  the  weak  stability  property.  Then  we 
can  efficiently  construct  a  binary  tree  such  that  the  ground-truth  clustering  is  a  pruning  of  this  tree. 

Proof:  As  in  the  proof  of  theorem  4.5.1  we  show  that  bottom-up  average-linkage  will  produce  the  desired 
result.  Specifically,  fhe  algorithm  is  as  follows:  we  begin  with  n  clusters  of  size  1,  and  then  at  each  step 
we  merge  the  two  clusters  C,  C  such  that  K{C,  C)  is  highest. 

We  prove  correctness  by  induction.  In  particular,  assume  that  our  current  clustering  is  laminar  with 
respect  to  the  ground  truth  clustering  (which  is  true  at  the  start).  That  is,  for  each  cluster  C  in  our  current 
clustering  and  each  Cr  in  the  ground  truth,  we  have  either  C  C  Cr,  ox  Cr  CL  C  ox  C  Cr  =  0-  Now, 
consider  a  merge  of  two  clusters  C  and  C' .  The  only  way  that  laminarity  could  fail  to  be  satisfied  after  the 
merge  is  if  one  of  the  two  clusters,  say,  C ,  is  strictly  contained  inside  some  ground-truth  cluster  Cr'  and 
yet  C  is  disjoint  from  Cr'. 

We  distinguish  a  few  cases.  First,  assume  that  C  is  a  cluster  Cr  of  the  ground-truth.  Then  by  definition, 
K{C' ,  Cr'—C)  >  K{C' ,  C).  Furthermore,  K{C’ ,  Cr'  —  C)  is  a  weighted  average  of  the  A'(C",  C")  over 
the  sets  C  C  Cr'  —  C  in  our  current  clustering  and  so  at  least  one  such  C”  must  satisfy  K{C' ,  C)  > 
K{C',  C).  However,  this  contradicts  the  specification  of  the  algorithm,  since  by  definition  it  merges  the 
pair  C,  C  such  that  K{C',  C)  is  greatest. 

Second,  assume  that  C  is  strictly  contained  in  one  of  the  ground-truth  clusters  Cr-  Then,  by  the 
weak  stability  property,  either  K{C-,  Cr  —  C)  >  K{C,  C)  or  K{C',  Cr'  —  C)  >  K{C,  C')-  This  again 
contradicts  the  specification  of  the  algorithm  as  in  the  previous  case. 

Finally  assume  that  C  is  a  union  of  clusters  in  the  ground-truth  Ci , . . .  Ck'  -  Then  by  definition, 
K{C',Cr'  -  C')  >  K{C',Ci),  for  i  =  l,-.-k',  and  so  K{C',Cr'  -  C')  >  K{C',C)-  This  again 
leads  to  a  contradiction  as  argued  above.  ■ 

While  natural.  Properties  5  and  6  are  still  somewhat  brittle:  in  the  example  of  Figure  4. 1 ,  for  instance, 
if  one  adds  a  small  number  of  edges  with  similarity  1  between  the  natural  clusters,  then  the  properties  are 
no  longer  satisfied  for  them  (because  pairs  of  elements  connected  by  these  edges  will  want  to  defect).  We 
can  make  the  properties  more  robust  by  requiring  that  stability  hold  only  for  large  sets.  This  will  break 
the  average-linkage  algorithm  used  above,  but  we  can  show  that  a  more  involved  algorithm  building  on 
the  approach  used  in  Section  4.4  will  nonetheless  find  an  approximately  correct  tree.  For  simplicity,  we 
focus  on  broadening  the  strong  stability  property,  as  follows  (one  should  view  s  as  small  compared  to  e/k 
in  this  definition): 

Property  7  The  similarity  function  K  satisfies  the  (s,7)-strong  stability  of  large  subsets  property  for 
the  clustering  problem  {S,  1)  if  for  all  clusters  Cr,  Cr',  r  r'  in  the  ground-truth,  for  all  A  C  Cr, 


90 


C  Ct-i  with  1^1  +  >  sn  we  have 

K{A,  Cr\A)>  K{A,  A')  +  7. 

The  idea  of  how  we  can  use  this  property  is  we  will  first  run  an  algorithm  for  the  list  model  much  like 
Algorithm  2,  viewing  its  output  as  simply  a  long  list  of  candidate  clusters  (rather  than  cluster/ng^).  In 

oi  ^  log  -  log  — ) 

particular,  we  will  get  a  list  C  of  k  '  clusters  such  that  with  probability  at  least  1  —  6  any 

cluster  in  the  ground-truth  of  size  at  least  ^  is  close  to  one  of  the  clusters  in  the  list.  We  then  run  a  second 
“tester”  algorithm  that  is  able  to  throw  away  candidates  that  are  sufficiently  non-laminar  with  respect  to 
the  correct  clustering  and  assembles  the  ones  that  remain  into  a  tree.  We  present  and  analyze  the  tester 
algorithm,  Algorithm  4 ,  below. 


Algorithm  4  Testing  Based  Algorithm,  Tree  Model. 

Input:  Data  set  S,  similarity  function  K,  parameters  7  >  0,  /c  G  Z+,  f,g,s,a  >  0.  A  list  of 

clusters  C  with  the  property  that  any  cluster  C  in  the  ground-truth  is  at  least  /-close  to  one  of  them. 
Output:  A  tree  on  subsets. 

1 .  Throw  out  all  clusters  of  size  at  most  an.  For  every  pair  of  clusters  C,  C  in  our  list  C  of  clusters 
that  are  sufficiently  “non-laminar”  with  respect  to  each  other  in  that  |C  \  C"|  >  gn,  |C'  \  Cl  >  gn 
and  |C  n  C'l  >  gn,  compute  iF(C  n  C',  C  \  C')  and  Ar(C  n  C',  C'  \  C).  Throw  out  whichever 
one  does  worse:  i.e.,  throw  out  C  if  the  first  similarity  is  smaller,  else  throw  out  C' .  Let  C!  be  the 
remaining  list  of  clusters  at  the  end  of  the  process. 

2.  Greedily  sparsify  the  list  C  so  that  no  two  clusters  are  approximately  equal  (that  is,  choose  a  clus¬ 
ter,  throw  out  all  that  are  approximately  equal  to  it,  and  repeat).  We  say  two  clusters  C,  C  are 
approximately  equal  if  |C  \  C'|  <  gn,  |C'  \  C|  <  gn  and  |C'  n  C|  >  gn.  Let  C"  be  the  list 
remaining. 

3.  Construct  a  forest  on  the  remaining  list  C" .  C  becomes  a  child  of  C'  in  this  forest  if  C'  approxi¬ 
mately  contains  C,  i.e.  |C  \  C'|  <  gn,  |C'  \  C|  >  gn  and  |C'  n  C|  >  gn. 

4.  Complete  the  forest  arbitrarily  into  a  tree. 

Theorem  4.5.3  Let  K  be  a  similarity  function  satisfying  {s,y)-strong  stability  of  large  subsets  for  the 
clustering  problem  (S',  1).  Let  C  be  a  list  of  clusters  such  that  any  cluster  in  the  ground-truth  of  size  at 
least  an  is  f -close  to  one  of  the  clusters  in  the  list.  Then  Algorithm  4  with  parameters  satisfying  s-\-f  <  g, 
f  <  57/10  and  a  >  6kg  yields  a  tree  such  that  the  ground-truth  clustering  is  2ak-close  to  a  pruning  of 
this  tree. 

Proof:  Let  k'  be  the  number  of  “big”  ground-truth  clusters:  the  clusters  of  size  at  least  an;  without 
loss  of  generality  assume  that  Ci, ...,  Ck'  are  the  big  clusters. 

Let  C'l,  ...,C'j.,  be  clusters  in  C  such  that  d{Ci,  C')  is  at  most  /  for  all  i.  By  Property  7  and  Lemma  4.5.4 
(stated  below),  we  know  that  after  Step  1  (the  “testing  of  clusters”  step)  all  the  clusters  C^,  ...,C^,  survive; 
furthermore,  we  have  three  types  of  relations  between  the  remaining  clusters.  Specifically,  eifher: 

(a)  C  and  C'  are  approximately  equal;  fhaf  means  |C  \  C'|  <  gn,  |C'  \  C|  <  gn  and  |C'  n  C|  >  gn. 

(b)  C  and  C'  are  approximately  disjoin!;  fhaf  means  |C  \  C'\  >  gn,  |C'  \  C|  >  ^n  and  |C'  n  C|  <  gn. 

(c)  or  C'  approximately  confains  C;  fhaf  means  |C  \  C'|  <  gn,  \C'  \  C|  >  gn  and  |C'  n  C|  >  gn. 

Lef  C"  be  fhe  remaining  lisf  of  clusters  after  sparsificafion.  If’s  easy  fo  show  fhaf  fhere  exisfs  C'(, ..., 

in  C"  such  fhaf  d{Ci,  C'l)  is  af  mosf  (/  -|-  2g),  for  all  i.  Moreover,  all  fhe  elemenfs  in  C"  are  eifher  in  fhe 
relafion  “subsef”  or  “disjoin!”.  Also,  since  all  fhe  clusters  C'l,  ...,  Ck'  have  size  af  leas!  an,  we  also  have 


91 


that  C'l ,  C”  are  in  the  relation  “disjoint”,  for  all  i,j,  i  /  j.  That  is,  in  the  forest  we  construct  C”  are  not 
descendants  of  one  another. 

We  show  C'l, Cy  are  part  of  a  pruning  of  small  error  rate  of  the  final  tree.  We  do  so  by  exhibiting 
a  small  extension  to  a  list  of  clusters  CJ"  that  are  all  approximately  disjoint  and  nothing  else  in  £"  is 
approximately  disjoint  from  any  of  the  clusters  in  CJ"  (thus  CJ”  will  be  the  desired  pruning).  Specifically 
greedily  pick  a  clusfer  Ci  in  CJ'  fhaf  is  approximately  disjoint  from  C",  ...,(7^',,  and  in  general  in  step 
i  >  1  greedily  pick  a  cluster  Ci  in  C"  that  is  approximately  disjoint  from  C'(, ...,  C'^,,  Ci, . . . ,  Ci-i.  Let 
C",  .••)  Cl, . . . ,  be  the  list  C" .  By  design,  C"  will  be  a  pruning  of  the  final  tree  and  we  now  claim 
its  total  error  is  at  most  2akn.  In  particular,  note  that  the  total  number  of  points  missing  from  C'(, C'^, 
is  atmost  A:(/  +  2p)n  +  A:an  <  ^kan.  Also,  by  construction,  each  Cj  must  contain  at  least  an  — (A: +  f)5rn 
new  points,  which  together  with  the  above  implies  that  k  <  2k.  Thus,  the  total  error  of  C"  overall  is  at 
most  ^akn  +  2kk'gn  <  2akn.  ■ 

Lemma  4.5.4  Let  K  be  a  similarity  function  satisfying  the  (s,  j)-strong  stability  of  large  subsets  property 
for  the  clustering  problem  (3,1).  Let  C,  C  be  such  that  \  Cf\C'\  >  gn,  |  C  \  C"  |  >  gn  and  |  C"  \  Cl  >  gn. 
Let  C*  be  a  cluster  in  the  underlying  ground-truth  such  that  |C*  \  C|  <  fn  and  |C  \  C*|  <  fn.  Let 
L  =  CnC'.Ifs  +  f<gandf<  gy/10  ,  then  K{I,  C\J)>  K{L,  C  \  J). 

Proof:  Let  /*  =  /  n  C*.  So,  I*  =  C  H  C'  n  C*.  We  prove  first  that 

K{I,C\J)  >  K{L*,C*\L*)  -y/2.  (4.1) 


Since  K{x,  x')  >  —1,  we  have 

K{J,  c  \  /)  >  (1  -  pi)K{j  nc*,{C\J)n  c*)  -  pi, 

where  1  —  pi  =  ^  By  assumption  we  have  |/|  >  gn,  and  also  |/  \  /*|  <  fn.  That  means 

\n  ^  \i\-\i\n  >  Similarly,  |C  \  /|  >  gn  and  |(C  \  /)  n  C*|  <  |C  \  C*|  <  fn.  So, 

I(c\j)nc*|  _  |c\/|-|(c\/)nc*|  ^  ff-/ 

\C\I\  -  9  ■ 

Let  us  denote  by  1  —  p  the  quantity  ■  We  have: 

K{I,  c  \  I)  >  (1  -  p)K{r,  (C  \  /)  n  C*)  -  p.  (4.2) 

Let  A  =  (C*  \  I*)  n  C  and  S  =  (C*  \  /*)  n  C.  We  have 

K{J*,C*\I*)  =  (I  -  a)K(I* ,  A)  -  aK(L* ,  B),  (4.3) 

where  1  —  a  =  Note  that 

A  =  {c*\i*)nc  =  (C*  n C)  \  (r  n C)  =  (C*  n C)  \ r 

and 

(C  \  I)  n  c*  =  (C  n  c*)  \  (/  n  c*)  =  (C*  n  C)  \  r , 

so  A  =  (C  \  I)  n  C*.  Furthermore 

|(c\/) nc*|  =  |(c\c')\(c\(c'nc*))|  >  |c\c'|-l<^\(<^'nc*)|  >  ICVC'I-ICV*^*!  >  gn-fn. 


92 


We  also  have  \B\  =  \{C*  \  I*)  n  C]  >  \C*  \  C\.  These  imply  that  1  —  a  =  |^[^jg|  = 
and  furthermore  =  — 1  +  Equation  (4.3)  implies 

K(r,A)  =  -^K(r,c*\r)  — —aiK(r,B) 

1  —  a  1  —  ai 

and  since  K(x,  x')  <  1,  we  obtain: 

f 

K(r,A)  >  K(r,c*  \r)  — .  (4.4) 

g- 1 


Overall,  combining  (4.2)  and  (4.4)  we  obtain:  K{I,C  \  I)  >  {I  —  p)  K{I* ,C*  \  I*)  — 

/ 


—  p,  so 


K{i,c\i)  >  K{r,c*\r)-2p-{i-p) 


g-  f 


Weprove  now  that  2p+(l— <  7/2,  which  finally  implies  relation  (4.1 ).  Since  1—p  =  (  )  ,  we 


9-f 


-  I  9-f 


havep=  so2p+(l-p)^  =  ^^^2 - r  ^^2 


+  =5/_2 


< 


^_2M=£  +  /k^  =  4/_2 

9  9  9 

7/2,  since  by  assumption  /  <  57/ 10. 

Our  assumption  that  it'  is  a  similarity  function  satisfying  the  strong  stability  property  with  a  threshold 
sn  and  a  7-gap  for  our  clustering  problem  (5, 1),  together  with  the  assumption  s  +  f  <  g  implies 


K{r,c*  \ r )  >  K{i\c'  \  (r  u c*))  +  7. 

We  finally  prove  that 

K{I*,C'  \  {I*  U  C*))  >  K{I,  C'\I)-  -il2. 
The  proof  is  similar  to  the  proof  of  statement  (4.1 ).  First  note  that 

iT(/,  c"  \  /)  <  (1  -  p2)K[i\  {C  \  I)  n  c*)  +  P2, 


(4.5) 


(4.6) 


where  1  -  bo  -  ^ 
wnere  i  P2  —  |j|  |(^/\/| 


We  know  from  above  that  ^  and  we  can  also  show 

01  -  9 

^  >  2^.  So  1  —  p2  >  P2  <  2^  <  7/2,  as  desired. 

To  complete  the  proof  note  that  relations  (4. 1 ),  (4.5 )  and  (4.6 )  together  imply  the  desired  result,  namely 
thatit:(/,C7\/)  >  it:(/,(7'\/).  ■ 


Theorem  4.5.5  Let  K  be  a  similarity  function  satisfying  the  {s,  y)-strong  stability  of  large  subsets  prop¬ 
erty  for  the  clustering  problem  (S',  1).  Assume  that  s  =  0{e^'y /k'^).  Then  using  Algorithm  4  with  para¬ 
meters  a  =  0{e/k),  g  =  0{e‘^fk'^),  f  =  0{e^ylk‘^),  together  with  Algorithm  2  we  can  with  probability 
1  —  5  produce  a  tree  with  the  property  that  the  ground-truth  is  e-close  to  a  pruning  of  this  tree.  Moreover, 
the  size  of  this  tree  is  0{k/e). 

Proof:  First,  we  run  Algorithm  2  get  a  list  £  of  clusters  such  that  with  probability  at  least  1  —  5  any 
cluster  in  the  ground-truth  of  size  at  least  ^  is  /-close  to  one  of  the  clusters  in  the  list.  We  can  ensure 

that  our  list  C  has  size  at  most  We  then  run  Procedure  4  with  parameters  a  =  0{e/k), 

g  =  0{e‘^lk‘^),  f  =  0(e^7/A:^).  We  thus  obtain  a  tree  with  the  guarantee  that  the  ground-truth  is  e-close 
to  a  pruning  of  this  tree  (see  Theorem  4.5.3).  To  complete  the  proof  we  only  need  to  show  that  this  tree 
has  0{k/e)  leaves.  This  follows  from  the  fact  that  all  leaves  of  our  tree  have  at  least  an  points  and  the 
overlap  between  any  two  of  them  is  at  most  gn  (for  a  formal  proof  see  lemma  4.5.6).  ■ 


93 


Lemma  4.5.6  Let  Pi,  Pg  be  a  quasi-partition  of  S  such  that  \Pi\  >  n|  and  \Pi  Ci  Pj\  <  gnfor  all 
i,j  G  {1, . . .  ,s},  i  /  j.  If  g  =  then  s  <  2^. 

Proof:  Assume  for  contradiction  that  s  >  L  =  2^,  and  consider  the  first  L  parts  Pi,  Pl-  Then 
K-  2^gn^  2^  is  a  lower  bound  on  the  number  of  points  that  belong  to  exactly  one  of  the  parts  Pi, 
i  G  {1,  ■  ■  •  ,L}.  For  our  choice  of  g,  g  =  we  have  (n|  —  2^gn)  2^  =  2n  —  |n.  So  |n  is  a 
lower  bound  on  the  number  of  points  that  belong  to  exactly  one  of  the  parts  Pi,  i  G  L},  which  is 

impossible  since  |5|  =  n.  So,  we  must  have  s  <  2^.  ■ 

To  better  illustrate  our  properties,  we  present  a  few  interesting  examples  in  Section  4.8.4. 

4.6  Inductive  Setting 

In  this  section  we  consider  an  inductive  model  in  which  S  is  merely  a  small  random  subset  of  points  from 
a  much  larger  abstract  instance  space  X,  and  clustering  is  represented  implicitly  through  a  hypothesis 
h  :  X  — >  y.  In  the  list  model  our  goal  is  to  produce  a  list  of  hypotheses,  {hi, . . . ,  ht]  such  that  at  least 
one  of  them  has  error  at  most  e.  In  the  tree  model  we  assume  that  each  node  in  the  tree  induces  a  cluster 
which  is  implicitly  represented  as  a  function  /  :  X  — >  {0,1}.  For  a  fixed  free  t  and  a  poinf  x,  we  define 
t{x)  as  fhe  subsef  of  nodes  in  T  fhaf  confain  x  (fhe  subsef  of  nodes  /  G  f  wifh  f{x)  =  1).  We  say  fhaf  a 
free  T  has  error  af  mosf  e  if  T{X)  has  a  pruning  fi, ...,  fk'  of  error  af  mosf  e. 

We  analyze  in  fhe  following,  for  each  of  our  properties,  how  large  a  sef  S  we  need  fo  see  in  order  for 
our  lisf  or  free  produced  wifh  respecf  fo  S  fo  induce  a  good  solufion  wifh  respecf  fo  X. 

The  average  attraction  property.  Our  algorithms  for  the  average  attraction  property  (Property  3 )  and  the 
average  weighted  attraction  property  are  already  inherently  inductive. 

The  strict  separation  property.  We  can  adapt  the  algorithm  in  Theorem  4.3.2  to  the  inductive  setting  as 
follows.  We  first  draw  a  set  5  of  n  =  0(^  In  (j))  unlabeled  examples.  We  run  the  algorithm  described 
in  Theorem  4.3.2  on  this  set  and  obtain  a  tree  t  on  the  subsets  of  S.  Let  Q  be  the  set  of  leaves  of  this 
tree.  We  associate  each  node  uint  a  boolean  function  /„  specified  as  follows.  Consider  x  G  X,  and  let 
q{x)  G  Q  be  the  leaf  given  by  argmaXggQX(x,  q)',ifu  appears  on  the  path  from  q{x)  to  the  root,  then  set 
fu{x)  =  1,  otherwise  set  fu{x)  =  0. 

Note  that  n  is  large  enough  to  ensure  that  with  probability  at  least  1  —  5,  5  includes  at  least  a  point 
in  each  cluster  of  size  at  least  Remember  that  C  =  (Ci, . . . ,  Cfc}  is  the  correct  clustering  of  the  entire 
domain.  Let  Cs  be  the  (induced)  correct  clustering  on  our  sample  S  of  size  n.  Since  our  property  is 
hereditary.  Theorem  4.3.2  implies  that  Cs  is  a  pruning  of  t.  It  then  follows  from  the  specification  of  our 
algorithm  and  from  the  definition  of  the  strict  separation  property  that  with  probability  at  least  1  —  5  the 
partition  induced  over  the  whole  space  by  this  pruning  is  e-close  to  C. 

The  strong  stability  of  large  subsets  property.  We  can  also  naturally  extend  the  algorithm  for  Property  7 
to  the  inductive  setting.  The  main  difference  in  the  inductive  setting  is  that  we  have  to  estimate  (rather 
than  compute)  the  \Cr  \  Cj.i\,  \Cr'  \  Cr\,  \Cr  n  Cri\,  K{Cr  n  Cr',Cr  \  Cr')  and  K{Cr  n  Cr',Cr'  \  Cr) 
for  any  two  clusters  Cr,  Cr'  in  the  list  C.  We  can  easily  do  that  with  only  poly(/c,  1/e,  I/7, 1/5)  log(|£|)) 
additional  points,  where  £  is  the  input  list  in  Algorithm  4  (whose  size  depends  on  1/e,  I/7  and  k  only). 
Specifically,  using  a  modification  of  the  proof  in  Theorem  4.5.5  and  standard  concentration  inequalities 
(e.g.  the  McDiarmid  inequality  1 103 1)  we  can  show  that: 

Theorem  4.6.1  Assume  that  K  is  a  similarity  function  satisfying  the  (s,  j)-strong  stability  of  large  subsets 
property  for  (X,  1).  Assume  that  s  =  0(e^7/fe^).  Then  using  Algorithm  4  with  parameters  a  =  0{e/k), 
g  =  0{e^  jiff),  f  =  0{e^'y /k'^),  together  with  Algorithm  2  we  can  produce  a  tree  with  the  property  that 


94 


the  ground-truth  is  e-close  to  a  pruning  of  this  tree.  Moreover,  the  size  of  this  tree  is  0{k/e).  We  use 

0(;^ln(^)  •  (f)^  points  in  the  first  phase  and  log  ^  log log  k)  points  in 

the  second  phase. 

Note  that  each  cluster  is  represented  as  a  nearest  neighbor  hypothesis  over  at  most  k  sets. 

The  strong  stability  property.  We  first  note  that  we  need  to  consider  a  variant  of  our  property  that  has 
a  7-gap.  To  see  why  this  is  necessary  consider  the  following  example.  Suppose  all  K{x,  x')  values  are 
equal  to  1/2,  except  for  a  special  single  center  point  Xi  in  each  cluster  Ci  with  K{xi,x)  =  1  for  all  x  in 
Ci.  This  satisfies  sfrong-sfabilify  since  for  every  A  C  Ci  we  have  K{A,  Ci  \  A)  is  sfricfly  larger  fhan  1/2. 
Yef  if  is  impossible  fo  clusfer  in  fhe  inductive  model  because  our  sample  is  unlikely  fo  confain  fhe  cenfer 
poinfs.  The  varianf  of  our  properly  lhal  is  suiled  fo  fhe  induclive  selling  is  fhe  following: 

Property  8  The  similarity  function  K  satisfies  the  7-strong  stability  property  for  the  clustering  problem 
{X,  1)  if  for  all  clusters  Cr,  Cr',  r  r'  in  the  ground-truth,  for  all  A  C  Cr,  for  all  A'  C  Cr'  we  have 

K{A,Cr\A)>  K{A,A')  +  -i. 

For  this  property,  we  could  always  run  the  algorithm  for  Theorem  4.6.1 ,  though  running  time  would 
be  exponential  in  k  and  I/7.  We  show  here  how  we  can  get  polynomial  dependence  on  these  parameters 
by  adapting  Algorithm  3  to  the  inductive  setting  as  in  the  case  of  the  strict  order  property.  Specifically,  we 
first  draw  a  set  S'  of  n  unlabeled  examples.  We  run  the  average  linkage  algorithm  on  this  set  and  obtain  a 
tree  t  on  the  subsets  of  S.  We  then  attach  each  new  point  x  to  its  most  similar  leaf  in  this  tree  as  well  as 
to  the  set  of  nodes  on  the  path  from  that  leaf  to  the  root.  For  a  formal  description  see  Algorithm  5 .  While 
this  algorithm  looks  natural,  proving  its  correctness  requires  more  involved  arguments. 


Algorithm  5  Inductive  Average  Linkage,  Tree  Model 

Input:  Similarity  function  K,  parameters  7,  e  >  0,  /c  G  Z+;  n  =  n{e,  7,  k,  <5); 

•  Pick  a  set  S'  =  {xi, . . . ,  Xn}  of  n  random  examples  from  X 

•  Run  the  average  linkage  algorithm  (Algorithm  3 )  on  the  set  S  and  obtain  a  tree  t  on  the  subsets  of 
S.  Let  Q  be  the  set  of  leaves  of  this  tree. 

•  Associate  each  node  uint  a.  function  /„  (which  induces  a  cluster)  specified  as  follows. 

Consider  x  £  X,  and  let  q{x)  £  Q  he  the  leaf  given  by  argmaXggQiT(a:,  q);  if  rt  appears  on  the 
path  from  q{x)  to  the  root,  then  set  fu{x)  =  1,  otherwise  set  fu{x)  =  0. 

•  Output  the  tree  t. 


We  show  in  the  following  that  for  n  =  poly  (A:,  1/e, I/7, 1/(5)  we  obtain  a  tree  T  which  has  a  pruning 
fi, ...,  fk'  of  error  at  most  e.  Specifically: 

Theorem  4.6.2  Let  K  be  a  similarity  function  satisfying  the  strong  stability  property  for  the  clustering 
problem  {X,  1).  Then  using  Algorithm  5  with  parameters  n  =  poly  (A:,  1/e,  I/7, 1/(5),  we  can  produce  a 
tree  with  the  property  that  the  ground-truth  is  e-close  to  a  pruning  of  this  tree. 

Proof:  Remember  that  C  =  {Ci, . . . ,  C^}  is  the  ground-truth  clustering  of  the  entire  domain.  Let 
Cs  =  {C'l, . . . ,  C^}  be  the  (induced)  correct  clustering  on  our  sample  S  of  size  n.  As  in  the  previous 
arguments  we  assume  that  a  cluster  is  big  if  it  has  probability  mass  at  least 

First,  Theorem  4.6.3  below  implies  that  with  high  probability  the  clusters  C[  corresponding  to  the  large 
ground-truth  clusters  satisfy  our  property  with  a  gap  7/2.  (Just  perform  a  union  bound  over  x  £  S  \  Cl.) 


95 


It  may  be  that  C[  corresponding  to  the  small  ground-truth  clusters  do  not  satisfy  the  property.  However, 
a  careful  analysis  of  the  argument  in  Theorem  4.5.1  shows  that  that  with  high  probability  Cs  is  a  pruning 
of  the  tree  t.  Furthermore  since  n  is  large  enough  we  also  have  that  with  high  probability  K{x,  C{x))  is 
within  7/2  of  K{x,  C'{x))  for  a  1  —  e  fraction  of  points  x.  This  ensures  that  with  high  probability,  for  any 
such  good  X  the  leaf  q{x)  belongs  to  C{x).  This  finally  implies  that  the  partition  induced  over  the  whole 
space  by  the  pruning  Cs  of  the  tree  t  is  e-close  to  C.  ■ 

Note  that  each  cluster  u  is  implicitly  represented  by  the  function  fu  defined  in  the  description  of 
Algorithm  5 

We  prove  in  the  following  that  for  a  sufficiently  large  value  of  n  sampling  preserves  stability.  Specifi¬ 
cally: 

Theorem  4.6.3  Let  Ci,  C2, . . .  ,Ck  be  a  partition  of  a  set  X  such  that  for  any  A  C  Ci  and  any  x  0  Ci, 

K{A,Ci\A)  >K{A,x)  +  j. 

Let  X  ^  Ci  and  let  C\  be  a  random  subset  of  n'  elements  of  Cs  Then,  n'  =  pofy(l/7,  log(l/(5))  is 
sufficient  so  that  with  probability  1  —  5,  for  any  A  C  C[, 

K{A,C'i\A)>K{A,x)  +  '^-. 

Proof:  Firsf  of  all,  fhe  claim  holds  for  singleton  subsefs  A  wifh  high  probabilify  using  a  Chernoff  bound. 
This  implies  fhe  condition  is  also  satisfied  for  every  subsef  A  of  size  af  mosf  yn' / 2.  Thus,  if  remains 
to  prove  fhe  claim  for  large  subsefs.  We  do  fhis  using  fhe  cuf-decomposifion  of  |113|  and  fhe  random 
sampling  analysis  of  [14J. 

Lef  N  =  \Ci\.  By  |113|,  we  can  decompose  fhe  similarity  mafrix  for  Ci  into  a  sum  of  cuf-mafrices 
Bi  +  B2  +  ■  ■  ■  +  Bs  plus  a  low  cuf-norm  mafrix  W  wifh  fhe  following  properties.  Firsf,  each  Bj  is  a 
cuf-mafrix,  meaning  fhaf  for  some  subsef  Sji  of  fhe  rows  and  subsef  S,2  of  fhe  columns  and  some  value 
dj,  we  have:  Bj[xy]  =  dj  for  x  E  Sji,y  E  Sj2  and  all  Bj[xy]  =  0  ofherwise.  Second,  each  dj  =  0(1). 
Finally,  s  =  1/e^  cuf-mafrices  are  sufficienf  so  fhaf  mafrix  W  has  cuf-norm  af  mosf  e^N:  fhaf  is,  for 
any  partition  of  fhe  vertices  A,  A',  we  have  |  J^xeAyeA'  ^[^2/11  —  moreover,  ||hF||oo  <  1/e  and 
\\W\\f<N. 

We  now  closely  follow  argumenfs  in  [  14|.  Firsf,  lef  us  imagine  fhaf  we  have  exacf  equality  Ci  =  Bi  + 

. .  .  +  Bs,  and  we  will  add  in  fhe  mafrix  W  lafer.  We  are  given  fhaf  for  all  A,  K (A,  O*  \  A)  >  K (A,  x)  +  7. 
In  particular,  fhis  frivially  means  fhaf  for  each  “profile”  of  sizes  {ty>},  fhere  is  no  sef  A  salisfying 

I A  n  E  \fjr  —  ~\~  CkjlV 

|A|  >  (7/4)iV 

fhaf  violafes  our  given  condifion.  The  reason  for  considering  cuf-mafrices  is  fhaf  fhe  values  |A  n 
complefely  defermine  fhe  quanfify  K{A,  Ci\  A).  We  now  sef  a  so  fhaf  fhe  above  consfrainfs  defermine 
K{A,  Ci  \  A)  up  to  ±7/4.  In  particular,  choosing  a  =  o{y‘^  j s)  suffices.  This  means  fhaf  fixing  a  profile 
of  values  {ty>},  we  can  replace  “violates  our  given  condifion”  wifh  K{A,x)  >  cq  for  some  value  cq 
depending  on  fhe  profile,  losing  only  an  amounf  7/4.  We  now  apply  Theorem  9  (random  sub-programs  of 
LPs)  of  [  14|.  This  fheorem  slates  fhaf  wifh  probability  1—5,  in  fhe  subgraph  C',  fhere  is  no  sef  A'  satisfying 
fhe  above  inequalities  where  fhe  righl-hand-sides  and  objective  cq  are  reduced  by  0(^/log(l/5)/^/n). 
Choosing  n  log(l/5)/a^  we  gef  fhaf  wifh  high  probability  fhe  induced  cuf-mafrices  B[  have  fhe 
properly  fhaf  fhere  is  no  A'  satisfying 

|A^n5j^l  ^  [fjv  —  ot/2,tjr  ~\~  ot/2\N 

|A'|  >  (7/2)n' 


96 


with  the  objective  value  cq  reduced  by  at  most  7/4.  We  now  simply  do  a  union-bound  over  all  possible 
profiles  {tjr}  consisting  of  multiples  of  a  to  complete  the  argument. 

Finally,  we  incorporate  the  additional  matrix  W  using  the  following  result  from  [14J. 

Lemma  4.6.4  [14][Random  submatrix]  For  e,6  >  0,  and  any  W  an  N  x  N  real  matrix  with  cut-norm 
||M^||c  <  eN'^,  ||VF||oo  <  1/e  and  ||VF||i?  <  N,  let  S'  be  a  random  subset  of  the  rows  of  W  with 
n'  =  1 5^ I  and  let  W'  be  the  n'  x  n'  submatrix  ofW  corresponding  to  W.  For  n'  >  {ci/e'^6^)  log(2/e), 
with  probability  at  least  1  —  5, 

\\W'\\c  <  C2^n'2 

V5 

where  ci ,  C2  are  absolute  constants. 

We  want  the  addition  of  W'  to  influence  the  values  K{A,  C'  —  A)  by  0(7).  We  now  use  the  fact  that  we 
only  care  about  the  case  that  |A|  >  yn! j 2  and  \C'^  —  A\  >  yn' j 2,  so  that  it  suffices  to  affect  the  sum 
'^x&A  j/sC'-A  y)  o(7^n'^).  In  particular,  this  means  it  suffices  to  have  e  =  0(7^),  or  equivalently 
s  =  0(1/7^).  This  in  turn  implies  that  it  suffices  to  have  a  =  0(7®),  which  implies  that  n'  =  0(1/7^^) 
suffices  for  the  theorem.  ■ 


4.7  Approximation  Assumptions 

When  developing  a  c- approximation  algorithm  for  some  clustering  objective  function  F,  if  the  goal  is 
to  actually  get  the  points  correct,  then  one  is  implicitly  making  the  assumption  (or  hope)  that  any  c- 
approximation  to  F  must  be  e-close  in  symmetric  difference  to  the  target  clustering.  We  show  here  we 
show  how  assumptions  of  this  kind  can  be  viewed  as  special  cases  of  the  i/-strict  separation  property. 

Property  9  Given  objective  function  F,  we  say  that  a  metric  d  over  point  set  S  satisfies  the  {c,e)-F 
property  with  respect  to  target  C  if  all  clusterings  C'  that  are  within  a  factor  c  of  optimal  in  terms  of 
objective  F  are  e-close  to  C. 

We  now  consider  in  particular  the  /c-median  and  /c-center  objective  functions. 

Theorem  4.7.1  If  metric  d  satisfies  the  (2,  efk-median  property  for  dataset  S,  then  the  similarity  function 
—d  satisfies  the  v-strict  separation  property  for  v  =  4e. 

Proof:  Let  C  =  Ci,C2,  ■■■,  Ck  be  the  target  clustering  and  let  OPT  =  {OPTi,  OPT2  ...,  OPT^}  be  the 
/c-median  optimal  clustering,  where  \Ci  n  OPTi\  >  (1  —  e)n.  Let’s  mark  the  all  set  of  points  of  size 
at  most  ere  at  most  where  C  and  OPT  disagree. 

If  there  exists  an  unmarked  Xj  that  is  more  similar  to  some  unmarked  Zj  in  a  different  cluster  than  to 
some  unmarked  i/j  in  its  own  cluster,  and  if  so  we  mark  all  three  points.  If  this  process  halts  after  <  ere 
rounds,  then  we  are  happy:  the  unmarked  set,  which  has  at  least  (1  —  4e)re  points,  satisfies  strict  separation. 
We  now  claim  we  can  get  a  contradiction  if  the  process  lasts  longer.  Specifically,  begin  with  OPT  (not 
C)  and  move  each  Xj  to  the  cluster  containing  point  Zj.  Call  the  result  OPT'.  Note  that  for  all  j,  the  pair 
{xj,yj)  are  in  the  same  cluster  in  C  (because  we  only  chose  from  unmarked  points  where  C  and  OPT 
agree)  but  are  in  different  clusters  in  OPT'.  So,  fi(OPT',C')  >  ere.  However,  OPT'  has  cost  at  most 
2  OPT;  to  see  this  note  that  moving  Xi  into  the  cluster  of  the  corresponding  zi  will  increase  the  fe-median 
objective  by  at  most  cost'(xj)  <  d{xj,  zj)  +  cost(2;j)  <  d{xj,yj)  +  cost(2;j)  <  cost(xj)  -|-  cost(?/j)  -|- 
cost(zj).  Thus,  the  fc-median  objective  at  most  doubles,  i.e,  cost'(OPT')  <  cost(OPT)  contradicting 
our  initial  assumption.  ■ 

We  can  similarly  prove: 


97 


Theorem  4.7.2  If  the  metric  d  satisfies  the  (3,  e)-k-center  property,  then  the  similarity  function  {—d) 
satisfies  the  u -strict  separation  property  for  v  =  4e. 

So  if  the  metric  d  satisfies  the  (2,  e)-A:-median  or  the  (2,  e)-/c-center  property  for  dataset  S,  then 
the  similarity  function  —d  satisfies  fhe  i^-sfricf  separation  property  for  o  =  4e.  Theorem  4.3.3  (in  Sec¬ 
tion  4.7.1 )  then  implies  that  as  long  as  the  smallest  cluster  in  the  target  has  size  20en  we  can  produce  a 
tree  such  that  the  ground-truth  clustering  is  4e-close  to  a  pruning  of  this  tree. 

Note:  In  fact,  the  both  the  (2,  e)-A:-median  property  and  the  (2,  e)-A:-means  property  are  quite  a  bit  more 
restrictive  than  i/-strict  separation.  They  imply,  for  instance,  that  except  for  an  0(e)  fraction  of  “bad” 
points,  there  exists  d  such  that  all  points  in  the  same  cluster  have  distance  much  less  than  d  and  all  points 
in  different  clusters  have  distance  much  greater  than  d.  In  contrast,  z/-strict  separation  would  allow  for 
different  distance  scales  at  different  parts  of  the  graph. 

We  have  further  exploited  this  in  recent  work  |42|.  Specifically  in  [42|  we  show  that  if  we  assume  that 
any  c-approximation  to  the  k-median  objective  is  e-close  to  the  target — then  we  can  produce  clusterings 
that  are  0(e)-close  to  the  target,  even  for  values  cfor  which  obtaining  a  c-approximation  is  NP-hard. 

In  particular,  the  main  results  of  [42]  for  the  are  the  following: 

Theorem  4.7.3  If  metric  d  satisfies  the  (1  -|-  a,  e)-k-median  property  for  dataset  S  and  each  cluster  in  the 
target  clustering  has  size  at  least  (4  -|-  15/a)en  -|-  2,  then  we  can  efficiently  find  a  clustering  that  is  e-close 
to  the  target. 

Theorem  4.7.4  If  metric  d  satisfies  the  (1  -|-  a,  e)-k-median  property  for  dataset  S,  then  we  can  efficiently 
find  a  clustering  which  is  0{el a)-close  to  the  target. 

These  results  also  highlight  a  somewhat  surprising  conceptual  difference  between  assuming  that  the 
optimal  solution  to  the  fe-median  objective  is  e-close  to  the  target,  and  assuming  that  any  approximately 
optimal  solution  is  e-close  to  the  target,  even  for  approximation  factor  say  c  =  1.01.  In  the  former  case, 
the  problem  of  finding  a  solution  that  is  0(e)-close  to  the  target  remains  computationally  hard,  and  yet  for 
the  latter  we  have  an  efficient  algorithm. 

We  also  prove  in  |42|  similar  results  for  the  fe-means  and  min-sum  properties. 

4.7.1  The  z/-strict  separation  Property 

We  end  this  section  by  proving  theorem  4.3.3 

Theorem  4.3.3  If  K  satisfies  z/-strict  separation,  then  so  long  as  the  smallest  correct  cluster  has  size 
greater  than  hvn,  we  can  produce  a  tree  such  that  the  ground-truth  clustering  is  z^-close  to  a  pruning  of 
this  tree. 

Proof:  Let  S"  C  S'  be  the  set  ol  {1  —  v)n  points  such  that  K  satisfies  strict  separation  with  respect 
to  S' .  Call  the  points  in  S'  “good”,  and  those  not  in  S'  “bad”  (of  course,  goodness  is  not  known  to  the 
algorithm).  We  first  generate  a  list  C  of  n?  clusters  such  that,  ignoring  bad  points,  any  cluster  in  the 
ground-truth  is  in  the  list.  We  can  do  this  by  for  each  point  x  £  S  creating  a  cluster  of  the  t  nearest  points 
to  it  for  each  Ann  <t<n. 

We  next  run  a  procedure  that  removes  points  from  clusters  that  are  non-laminar  with  respect  to  each 
other  without  hurting  any  of  the  correct  clusters,  until  the  remaining  set  is  fully  laminar.  Specifically,  while 
there  exist  two  clusters  C  and  C'  that  are  non-laminar  with  respect  to  each  other,  we  do  the  following: 

1 .  If  either  C  or  C  has  size  <  Avn,  delete  it  from  the  list.  (By  assumption,  it  cannot  be  one  of  the 
ground-truth  clusters). 

2.  If  C  and  C'  are  “somewhat  disjoint”  in  that  \C  \  C'\  >  2vn  and  \C'  \  C|  >  2vn,  each  point 
X  £  Cr\C'  chooses  one  of  C  or  C  to  belong  to  based  on  whichever  of  C \ C"  or  C"  \ C  respectively 


98 


has  larger  median  similarity  to  x.  We  then  remove  x  from  the  cluster  not  chosen.  Because  each  of 
C  \C'  and  C  \C  has  a  majority  of  good  points,  if  one  of  C  or  C  is  a  ground-truth  cluster  (with 
respect  to  S'),  all  good  points  x  in  the  intersection  will  make  the  correct  choice.  C  and  C'  are  now 
fully  disjoint. 

3.  If  C,  C  are  “somewhat  equal”  in  that  \C  \  C'\  <  2un  and  |C"  \  Cl  <  2un,  we  make  them  exactly 
equal  based  on  the  following  related  procedure.  Each  point  x  in  the  symmetric  difference  of  C 
and  C  decides  in  or  out  based  on  whether  its  similarity  to  the  {on  +  l)st  most-similar  point  in 
C  n  C'  is  larger  or  smaller  (respectively)  than  its  similarity  to  the  ( i^n  -|-  l)st  most  similar  point  in 
S'  \  (C  U  C).  If  X  is  a  good  point  in  C  \  C'  and  C  is  a  ground-truth  cluster  (with  respect  to  S'),  then 
X  will  correctly  choose  in,  whereas  if  C'  is  a  ground-truth  cluster  then  x  will  correctly  choose  out. 
Thus,  we  can  replace  C  and  C  with  a  single  cluster  consisting  of  their  intersection  plus  all  points  x 
that  chose  in,  without  affecting  the  correct  clusters. 

4.  If  none  of  the  other  cases  apply,  it  may  still  be  there  exist  C,  C  such  that  C  “somewhat  contains” 
C  in  that  | C  \  C" |  >  2vn  and  0  <  | C"  \  C|  <  2on.  In  this  case,  choose  the  largest  such  C  and  apply 
the  same  procedure  as  in  Step  3,  but  only  over  the  points  x  G  C"  \  C.  At  the  end  of  the  procedure, 
we  have  C  ^  C  and  the  correct  clusters  have  not  been  affected  with  respect  to  the  good  points. 

Since  all  clusters  remaining  are  laminar,  we  can  now  arrange  them  into  a  forest,  which  we  then  arbitrarily 
complete  into  a  tree.  ■ 


4.8  Other  Aspects  and  Examples 

4.8.1  Computational  Hardness  Results 

Our  framework  also  allows  us  to  study  computational  hardness  results  as  well.  We  discuss  here  a  simple 
example. 

Property  10  A  similarity  function  K  satisfies  the  unique  best  cut  property  for  the  clustering  problem 
{S,l)  if  r  =  2  and  ^  K{x,x')  <  ^  K{x,x')  for  all  partitions  {A,  B)  {Ci,C2)  of  S. 

x&Ci,x'^C2  xGA,x'gB 

Clearly,  by  design  the  clustering  complexity  of  Property  10  is  1.  However,  we  have  the  following  compu¬ 
tational  hardness  result. 

Theorem  4.8.1  List- clustering  under  the  unique  best  cut  property  is  NP-hard.  That  is,  there  exists  e  >  0 
such  that  given  a  dataset  S  and  a  similarity  function  K  satisfying  the  unique  best  cut  property,  it  is  NP- 
hard  to  produce  a  polynomial-length  list  of  clusterings  such  that  at  least  one  is  e-close  to  the  ground 
truth. 

Proof:  It  is  known  that  the  MAX-CUT  problem  on  cubic  graphs  is  APX-hard  [  12 1  (i.e.  it  is  hard  to 
approximate  within  a  constant  factor  a  <  1). 

We  create  a  family  ((5,  /),  K)  of  instances  for  our  clustering  property  as  follows.  Let  G  =  {V,  E) 
be  an  instance  of  the  MAX-CUT  problem  on  cubic  graphs,  |U|  =  n.  For  each  vertex  z  G  U  in  the 
graph  we  associate  a  point  Xi  G  S';  for  each  edge  (i,  j)  £  E  v/e  define  K{xi,  Xj)  =  —1,  and  we  define 
K{xi,Xj)  =  0  for  each  {i,j)  ^  E.  Let  Sy  denote  the  set  {xj  :  i  G  V'}.  Clearly  for  any  given  cut 
(Cl,  V2)  in  G  =  {V,  E),  the  value  of  the  cut  is  exactly 

F{Sv,,Sv,)=  Y.  -K{x,x'). 

,x'£Sv2 

Let  us  now  add  tiny  perturbations  to  the  K  values  so  that  there  is  a  unique  partition  {Ci,C2)  = 
{Sy* ,  Sy* )  minimizing  the  objective  function  E,  and  this  partition  corresponds  to  some  maxcut  {Vf ,  Vf ) 


99 


of  G  (e.g.,  we  can  do  this  so  that  this  partition  corresponds  to  the  lexicographically  first  such  cut).  By 
design,  K  now  satisfies  fhe  unique  besf  cuf  properfy  for  fhe  clusfering  problem  S  wifh  fargef  clusfering 
(Ci,C2). 

Define  e  such  fhaf  any  clusfering  which  is  e-close  fo  fhe  correcf  clusfering  (Ci,  C2)  musf  be  af  leasf 
a-close  in  terms  of  the  max-cut  objective.  E.g.,  e  <  suffices  because  fhe  graph  G  is  cubic.  Now, 
suppose  a  polynomial  fime  algorifhm  produced  a  polynomial-sized  lisf  of  clusferings  wifh  fhe  guaranfee 
fhaf  af  leasf  one  clusfering  in  fhe  lisf  has  error  af  mosf  e  in  ferms  of  ifs  accuracy  wifh  respecf  fo  ((71,(72). 
In  (his  case,  we  could  (hen  jusf  evaluafe  fhe  cuf  value  for  all  fhe  clusferings  in  fhe  lisf  and  pick  fhe  besf 
one.  Since  af  leasf  one  clusfering  is  af  leasf  e-close  (o  ((7i,  G2)  by  assumption,  we  are  guaranfeed  fhaf  af 
leasf  one  is  wifhin  a  of  fhe  opfimum  cuf  value.  ■ 

Nofe  fhaf  we  can  gef  a  similar  resulfs  for  any  clusfering  objective  F  fhaf  (a)  is  NP-hard  (o  approximafe 
wifhin  a  consfanf  facfor,  and  (b)  has  fhe  smoofhness  properfy  fhaf  if  gives  approximafely  fhe  same  value 
fo  any  (wo  clusferings  fhaf  are  almosf  fhe  same. 


4.8.2  Other  interesting  properties 

An  inferesfing  relaxation  of  fhe  average  affracfion  properfy  is  (o  ask  fhaf  (here  exisfs  a  clusfer  so  fhaf  mosf 
of  fhe  poinfs  are  noficeably  more  similar  on  average  (o  ofher  poinfs  in  (heir  own  clusfer  (ban  (o  poinfs 
in  all  (he  ofher  clusfers,  and  fhaf  once  we  fake  ouf  fhe  poinfs  in  fhaf  clusfer  fhe  properfy  becomes  frue 
recursively  Formally: 

Property  11  A  similarity  function  K  satisfies  the  7-weak  average  attraction  property  for  the  clustering 
problem  (S,  1)  if  there  exists  cluster  Gr  such  that  all  examples  x  €  Gr  satisfy: 

K{x,  C{x))  >  K{x,S  \  Gr)  -f  7, 


and  moreover  the  same  holds  recursively  on  the  set  S  \  Gr- 
We  can  (hen  adapf  Algorifhm  2  fo  gef  fhe  following  resulf: 

Theorem  4.8.2  Let  K  be  a  similarity  function  satisfying  j-weak  average  attraction  for  the  clustering 

4fc  I  i  Sfe) 

problem  (5, 1).  Using  Algorithm  2  with  s(e,  7,  fc)  =  In  (^)  and  N{e,  7,  k)  =  (^)  ^  ^  we 

can  produce  a  list  of  at  most  clusterings  such  that  with  probability  1  —  6  at  least  one 

of  them  is  e-close  to  the  ground-truth. 

Sfrong  affracfion  An  inferesfing  properfy  fhaf  falls  in  befween  (he  weak  sfabilify  properfy  and  fhe  aver¬ 
age  affracfion  properfy  is  fhe  following: 

Property  12  The  similarity  function  K  satisfies  the  7-strong  attraction  property  for  the  clustering  prob¬ 
lem  (S,  1)  if  for  all  clusters  Gr,  Gr',  r  r'  in  the  ground-truth,  for  all  A  <Z  Gr  we  have 


K{A,Gr\A)  >  K{A,Cr')+y. 


We  can  inferpref  (he  sfrong  affracfion  properfy  as  saying  fhaf  for  any  fwo  clusfers  Gr  and  Gr'  in  fhe 
ground  frufh,  for  any  subsef  A  <Z  Gr,  the  subset  A  is  more  attracted  to  the  rest  of  its  own  cluster  than  to 
Gr'  ■  It  is  easy  to  see  that  we  cannot  cluster  in  the  tree  model,  and  moreover  we  can  show  an  lower  bound 
on  the  sample  complexity  which  is  exponential.  Specifically: 

Theorem  4.8.3  For  e  <  7/4,  the  y-strong  attraction  property  has  (e,  2)  clustering  complexity  as  large  as 

20(l/7)_ 

^Thanks  to  Sanjoy  Dasgupta  for  pointing  out  that  this  property  is  satisfied  on  real  datasets,  such  as  the  MINST  dataset. 


100 


Proof:  Consider  ^  blobs  of  equal  probability  mass.  Let’s  consider  a  special  matching  of  these  blobs 
Li),  (i?2,  ^2),  •  ■  • ,  {Rn/2:  Ln/2)}  and  let’s  define  K{x,  x')  =  0  if  x  G  Ri  and  x'  G  Li  for  some 
i  and  K{x,  x')  =  1  otherwise.  Then  each  partition  of  these  blobs  into  two  pieces  of  equal  size  that  fully 
’’respects”  our  matching  (in  the  sense  that  for  all  i  Ri,  Li  are  on  two  different  parts)  satisfies  Properfy  12 
wifh  a  gap  7'  =  27.  The  desired  resulf  fhen  follows  from  fhe  facf  fhaf  fhe  number  of  such  parfifions  (which 

_ ^ 

splif  fhe  sef  of  blobs  info  fwo  pieces  of  equal  “size”  and  fully  respecf  our  mafching)  is  2  27  .  | 

If  would  be  inferesfing  fo  see  if  one  could  develop  algorifhms  especially  designed  for  fhis  properly 
lhal  provides  heifer  guarantees  fhan  Algorifhm  2 

4.8.3  Verification 

A  nafural  queslion  is  how  hard  is  if  (compufalionally)  lo  defermine  if  a  proposed  cluslering  of  a  given 
dafasel  S  salisfies  a  given  properly  or  nof.  If  is  imporlanl  lo  nole,  however,  fhaf  we  can  always  in  poly¬ 
nomial  lime  compute  Ihe  dislance  belween  Iwo  clusterings  (via  a  weighted  malching  algorilhm).  This 
Ihen  ensures  lhal  Ihe  user  is  able  lo  compare  in  polynomial  lime  Ihe  largel/buill-in  clustering  wilh  any 
proposed  clustering.  So,  even  if  il  is  compulalionally  difficully  lo  determine  if  a  proposed  clustering  of 
a  given  dalasel  S  salisfies  a  cerlain  properly  or  nol,  Ihe  properly  is  slill  reasonable  lo  consider.  Note  lhal 
compuling  Ihe  dislance  belween  Iwo  Ihe  largel  clustering  and  any  olher  clustering  is  Ihe  analogue  of  com¬ 
puting  Ihe  empirical  error  rate  of  a  given  hypolhesis  in  Ihe  PAC  selling  [201 1;  furthermore,  Ihere  are  many 
learning  problems  in  Ihe  PAC  model  where  Ihe  consistency  problem  is  NP-hard  (e.g.  3-Term  DNF),  even 
Ihough  Ihe  corresponding  classes  are  leamable. 

4.8.4  Examples 

In  all  Ihe  examples  below  we  consider  symmelric  similarity  functions. 


Strict  separation  and  Spectral  partitioning  Figure  4.2  shows  lhal  il  is  possible  for  a  similarity  function 
lo  satisfy  Ihe  slricl  separation  property  for  a  given  clustering  problem  for  which  Theorem  4.3.2  gives  a 
good  algorilhm,  bul  nonelheless  lo  fool  a  slraighlforward  speclral  clustering  approach. 

Consider  2A:  blobs  Bi,  B2,  ■ .  ■ ,  Bk,  Bf  B2,  ■ .  ■ ,  B'^of  equal  probability  mass.  Assume  lhal  iT(x,  x')  = 
1  if  X  G  and  x'  G  B'^,  and  K{x,  x')  =  1  if  x,  x'  G  Bi  or  x,  x'  G  B\,  for  alH  G  {1, . . . ,  k}.  Assume  also 
K{x,  x')  =  0.5  if  X  G  and  x'  G  Bj  or  x  G  B^  and  x'  G  B'p  for  i  j',  lei  K{x,  x')  =  0  olherwise.  See 
Figure  4.2  (a).  Lei  Ci  =  BiVJ  B'-,  for  alH  G  {1, . . . ,  A:}.  Il  is  easy  lo  verify  lhal  Ihe  clustering  Ci, ...  ,Ck 
(see  Figure  4.2  (b))  is  consislenl  wilh  Property  4.2  (a  possible  value  for  Ihe  unknown  Ihreshold  is  c  =  0.7). 
However  for  k  large  enough  Ihe  cul  of  min-conduclance  is  Ihe  one  shown  in  Figure  4.2  (c),  namely  Ihe 
cul  lhal  splils  Ihe  graph  into  parts  {Bi,  B2, . . . ,  B^}  and  {B[,  B2, . . . ,  B'^^}.  A  direcl  consequence  of  Ihis 
example  is  lhal  applying  a  speclral  clustering  approach  could  lead  to  a  hypolhesis  of  high  error. 


Linkage-based  algorithms  and  strong  stability  Figure  4.3  (a)  gives  an  example  of  a  similarity  func¬ 
tion  lhal  does  nol  satisfy  Ihe  slricl  separation  property,  bul  for  large  enough  m,  w.h.p.  will  satisfy  Ihe 
slrong  slabilily  property.  (This  is  because  Ihere  are  al  mosl  subsels  A  of  size  k,  and  each  one  has  fail¬ 
ure  probability  only  However,  single-linkage  using  Kmax{C,  C)  would  still  work  well  here. 

Figure  4.3  (b)  extends  Ihis  to  an  example  where  single-linkage  using  Kmax{C,  C)  fails.  Figure  4.3  (c) 
gives  an  example  where  strong  stability  is  not  satisfied  and  average  linkage  would  fail  too.  However  notice 
lhal  Ihe  average  allraclion  property  is  satisfied  and  Algorilhm  2  will  succeed. 


101 


Figure  4.2:  Consider  2k  blobs  Bi,  B2, . . . ,  B^,  B[,  . . . ,  of  equal  probability  mass.  Points  inside  the  same 

blob  have  similarity  1.  Assume  that  Ar(a::,  a;')  =  1  if  x  G  Bi  and  x'  G  S'.  Assume  also  Ar(x,  x')  =  0.5  if  x  G  and 
x'  G  Bj  or  X  G  S'  and  x'  G  S',  for  i  ^  let  A'(x,  x')  =  0  otherwise.  Let  Ci  =  BiU  S',  for  alH  G  k}.  It 

is  easy  to  verify  that  the  clustering  Ci, . . . ,  C/j  is  consistent  with  Property  1  (part  (b)).  However,  for  k  large  enough 
the  cut  of  min-conductance  is  the  cut  that  splits  the  graph  into  parts  {Si,  S2, . . . ,  S^}  and  {S{,  S2, . . . ,  S{,}  (part 
(c)). 


4.9  Conclusions  and  Discussion 


In  this  chapter  we  provide  a  generic  framework  for  analyzing  what  properties  of  a  similarity  function  are 
sufficient  to  allow  it  to  be  useful  for  clustering,  under  two  natural  relaxations  of  the  clustering  objective. 
We  propose  a  measure  of  the  clustering  complexity  of  a  given  property  that  characterizes  its  information- 
theoretic  usefulness  for  clustering,  and  analyze  this  complexity  for  a  broad  class  of  properties,  as  well  as 
develop  efficient  algorithms  that  are  able  to  take  advantage  of  them. 

Our  work  can  be  viewed  both  in  terms  of  providing  formal  advice  to  the  designer  of  a  similarity 
function  for  a  given  clustering  task  (such  as  clustering  query  search  results)  and  in  terms  of  advice  about 
what  algorithms  to  use  given  certain  beliefs  about  the  relation  of  the  similarity  function  to  the  clustering 
task.  Our  model  also  provides  a  better  understanding  of  when  (in  terms  of  the  relation  between  the 
similarity  measure  and  the  ground-truth  clustering)  different  hierarchical  linkage-based  algorithms  will 
fare  better  than  others.  Abstractly  speaking,  our  notion  of  a  property  parallels  that  of  a  data-dependent 
concept  class  [203|  (such  as  large-margin  separators)  in  the  context  of  classification. 

Open  questions:  Broadly,  one  would  like  to  analyze  other  natural  properties  of  similarity  functions,  as 
well  as  to  further  explore  and  formalize  other  models  of  interactive  feedback.  In  terms  of  specific  open 
questions,  for  the  average  attraction  property  (Property  3 )  we  have  an  algorithm  that  for  fe  =  2  produces 
a  list  of  size  approximately  ^  lower  bound  on  clustering  complexity  of  One 

natural  open  question  is  whether  one  can  close  that  gap.  A  second  open  question  is  that  for  the  strong 
stability  of  large  subsets  property  (Property  7),  our  algorithm  produces  hierarchy  but  has  larger  running 
time  substantially  larger  than  that  for  the  simpler  stability  properties.  Can  an  algorithm  with  running  time 
polynomial  in  k  and  I/7  be  developed?  Can  one  prove  stability  properties  for  clustering  based  on  spectral 
methods,  e.g.,  the  hierarchical  clustering  algorithm  given  in  [85  |?  More  generally,  it  would  be  interesting 
to  determine  whether  these  stability  properties  can  be  further  weakened  and  still  admit  a  hierarchical 
clustering.  Finally,  in  this  work  we  have  focused  on  formalizing  clustering  with  non-interactive  feedback. 
It  would  be  interesting  to  formalize  clustering  with  other  natural  forms  of  feedback. 


102 


Cl  C2 


Figure  4.3:  Part  (a):  Consider  two  blobs  Bi,  B2  with  m  points  each.  Assume  that  K{x,  x')  =  0.3  if  a;  S  i?i  and 
x'  G  B2,  K{x,  x')  is  random  in  {0, 1}  if  x,  x'  G  Bi  for  all  i.  Clustering  Ci,  C2  does  not  satisfy  Property  1,  but 
for  large  enough  m,  w.h.p.  will  satisfy  Property  5  Part  (b):  Consider  four  blobs  Bi,  B2,  B3,  B4  of  m  points  each. 
Assume  K{x,  x')  =  x,x'  G  Bi,  for  all  i,  K{x,  x')  =  0.85  \fx  G  Bi  and  x'  G  B2,  K{x,  x')  =  0.85  x  G  B^ 
and  x'  G  B4,  K{x,x')  =  0  if  a;  G  i?i  and  x'  G  B4,  K{x,x')  =  G\fx  G  B2  and  x'  G  B^.  Now  K{x,x')  =  0.5  for 
all  points  x  G  Bi  and  x'  G  B^,  except  for  two  special  points  Xi  G  Bi  and  3:3  G  B^  for  which  K{xi,x^)  =  0.9. 
Similarly  K{x,  x')  =  0.5  for  all  points  x  G  B2  and  x'  G  B4,  except  for  two  special  points  X2  G  B2  and  X4  G  B4 
for  which  K{x2,X4)  =  0.9.  For  large  enough  m,  clustering  Ci,  C2  satisfies  Property  5.  Part  (c):  Consider  two 
blobs  Bi,  B2  of  m  points  each,  with  similarities  within  a  blob  all  equal  to  0.7,  and  similarities  between  blobs  chosen 
uniformly  at  random  from  {0,1}. 


Algorithm  6  Sampling  Based  Algorithm,  List  Model 

Input:  Data  set  S,  similarity  function  K,  parameters  7,  e  >  0,  fc  G  Z+;  di{e,  7,  k,  5),  d2{e,  7,  k,  5). 

•  Set  C  =  %. 

•  Pick  a  set  [/  =  (xi, . . . ,  of  di  random  examples  from  S,  where  di  =  di(e,  7,  k,  (5).  Use  U  to 
define  the  mapping  pu  :  X  ^  pu{x)  =  {K{x,  xi),  iC(x,  X2), . .  • ,  K{x,  x^J). 

•  Pick  a  set  t/  of  ^2  =  d2{€,'y,  k,  5)  random  examples  from  S  and  consider  the  induced  set  p^(U). 

•  Consider  all  the  {k  +  1)'^^  possible  labellings  of  the  set  pu{U)  where  the  k  +  1st  label  is  used  to 
throw  out  points  in  the  v  fraction  that  do  not  satisfy  the  property.  For  each  labelling  use  the  Winnow 
algorithm  [164,  212]  to  learn  a  multiclass  linear  separator  h  and  add  the  clustering  induced  by  h  to 
C. 

•  Output  the  list  C. 


4.10  Other  Proofs 

Theorem  4.4.4  Let  iC  be  a  similarity  function  satisfying  the  {v,  7) -average  weighted  attraction  property 
for  the  clustering  problem  (5, 1).  Using  Algorithm  6  with  parameters  di  =  0{^  In  ( j))  and 

d2  =  O  ( ^  ^ In  di  -b  In  )  we  can  produce  a  list  of  at  most  k^ ^  <=32 )  clusterings  such  that  with  proba¬ 
bility  1  —  (5  at  least  one  of  them  is  e  -|-  za-close  to  the  ground-truth. 

Proof: 

For  simplicity  we  describe  the  case  k  =  2.  The  generalization  to  larger  k  follows  the  standard  multi¬ 
class  to  binary  reduction  [203] . 

For  convenience  let  us  assume  that  the  labels  of  the  two  clusters  are  {— 1,  -|-1}  and  without  loss  of 
generality  assume  that  each  of  the  two  clusters  has  at  least  an  e  probability  mass.  Let  U  be  a  random 
sample  from  S'  of  di  =  ^  ((4/7)^  +  l)  ln(4/d)  points.  We  show  first  that  with  probability  at  least  1  —  5, 


103 


the  mapping  pu  :  X  ^  defined  as 

pu{x)  =  {K{x,  xi),K{x,  X2), . . .  ,X(x,  XdJ) 

has  the  property  that  the  induced  distribution  pu{S)  in  has  a  separator  of  error  at  most  6  (of  the  1  —  u 
fraction  of  the  distribution  satisfying  the  property)  at  Li  margin  at  least  7/4. 

First  notice  that  di  is  large  enough  so  that  with  high  probability  our  sample  contains  at  least  d  = 
(4/7)^  ln(4/(5)  points  in  each  cluster.  Let  (7+  be  the  subset  of  U  consisting  of  the  first  d  points  of  true 
label  +1,  and  let  U~  be  the  subset  of  U  consisting  of  the  first  d  points  of  true  label  —1.  Consider  the 
linear  separator  [i  in  the  pu  space  defined  as  fdi  =  l{xi)w{xi),  for  Xi  G  U~  U  U~^  and  /3i  =  0  otherwise. 
We  show  that,  with  probability  at  least  {1  —  5),  (5  has  error  at  most  6  at  Li  margin  7/4.  Consider  some 
fixed  point  x  £  S.  We  begin  by  showing  that  for  any  such  x, 

^r  [l{x)f3  ■  pu{x)  >  >1  —  5^. 

To  do  so,  first  notice  that  d  is  large  enough  so  that  with  high  probability,  at  least  1  —  <5^,  we  have  both: 

\'^x'eu+[Hx')K{x,x)]  -E^f^s[w{x)K{x,x)\l{x)  =  1]|  <  ^ 

and 

\'^x'eu-[w{x)K{x,x)]  -Ej,>r^s['w{x)K{x,x)\l{x)  =  -1]|  < 

Let’s  consider  now  the  case  when  l{x)  =  1.  In  this  case  we  have 


l{x)P  ■  pu{x)  =  d  ^  w{xi)K{x,Xi)  -  2  w{xi)K{x,Xi) 

\  Xi&U+  Xi&U- 

and  so  combining  these  facts  we  have  that  with  probability  at  least  (1  —  the  following  holds: 

l{x)/3  ■  pu{x)  >  d{E^ir^s[w{x')K{x,x')\l{x')  =  1]  -  7/4  -  E^,^s[w{x')K{x,x')\l{x')  =  -1]  -  7/4). 


This  then  implies  that  f(x)/3-/9f/(x)  >  d'y/2.  Finally,  since G  [— 1, 1]  for  all  x',  and  since  iC(x,  x')  G 
[—1,1]  for  all  pairs  x,  x' ,  we  have  that  ||/3||i  <  d  and  ||/0[/(x)||oo  <  l^  which  implies 


Pr 

u 


/3  •  Pu{x) 
||/?||i||/3;7(a:)||oo 


>1-52. 


The  same  analysis  applies  for  the  case  that  /(x)  =  —  1. 

Lastly,  since  the  above  holds  for  any  x,  it  is  also  true  for  random  x  G  S',  which  implies  by  Markov’s 
inequality  that  with  probability  at  least  1  —  d,  the  vector  (3  has  error  at  most  <5  at  Li  margin  7/4  over 
Pu{S),  where  examples  have  Loo  norm  at  most  1. 

So,  we  have  proved  that  if  iC  is  a  similarity  function  satisfying  the  (0, 7)-average  weighted  attraction 
property  for  the  clustering  problem  (S',  1),  then  with  high  probability  there  exists  a  low-error  (at  most  6) 
large-margin  (at  least  separator  in  the  transformed  space  under  mapping  pu.  Thus,  all  we  need  now  to 
cluster  well  is  to  draw  a  new  fresh  sample  U,  guess  their  labels  (and  which  to  throw  out),  map  them  into 
the  transformed  space  using  pjj,  and  then  apply  a  good  algorithm  for  learning  linear  separators  in  the  new 
space  that  (if  our  guesses  were  correct)  produces  a  hypothesis  of  error  at  most  e  with  probability  at  least 
1  —  5.  Thus  we  now  simply  need  to  calculate  the  appropriate  value  of  ^2- 


104 


The  appropriate  value  of  d2  can  be  determined  as  follows.  Remember  that  the  vector  (5  has  error  at 
most  6  at  Li  margin  7/4  over  pu{S),  where  the  mapping  pu  produces  examples  of  L^o  norm  at  most 
1 .  This  implies  that  the  Mistake  bound  of  the  Winnow  algorithm  on  new  labeled  data  (restricted  to  the 
1  —  6  good  fraction)  is  0(;^  Indi).  Setting  6  to  be  sufficiently  small  such  that  with  high  probability  no 
bad  points  appear  in  the  sample,  and  using  standard  mistake  bound  to  PAC  conversions  1 163 1,  this  then 
implies  that  a  sample  size  of  size  ^2  =  Indi  +  In  )  is  sufficient.  ■ 


105 


106 


Chapter  5 

Active  Learning 


In  this  chapter  we  return  to  the  supervised  classification  setting  and  present  some  of  our  results  on  Active 
Learning.  As  mentioned  in  Chapter  1,  in  the  active  learning  model  |86,  94 1,  the  learning  algorithm  is 
allowed  to  draw  random  unlabeled  examples  from  the  underlying  distribution  and  ask  for  the  labels  of 
any  of  these  examples.  The  hope  is  that  a  good  classifier  can  be  learned  with  significantly  fewer  labels  by 
actively  directing  the  queries  to  informative  examples. 

As  in  passive  supervised  learning,  but  unlike  in  semi-supervised  learning  (which  we  discussed  in 
Chapter  2),  the  only  prior  belief  about  the  learning  problem  here  is  that  the  target  function  (or  a  good 
approximation  of  it)  belongs  to  a  given  concept  class.  For  some  concept  classes  such  as  thresholds  on 
the  line,  one  can  achieve  an  exponential  improvement  over  the  usual  sample  complexity  of  supervised 
learning,  under  no  additional  assumptions  about  the  learning  problem  |86,  94 1.  In  general,  the  speedups 
achievable  in  active  learning  depend  on  the  match  between  the  data  distribution  and  the  hypothesis  class, 
and  therefore  on  the  target  hypothesis  in  the  class.  The  most  noteworthy  non-trivial  example  of  improve¬ 
ment  is  the  case  of  homogeneous  (i.e.,  through  the  origin)  linear  separators,  when  the  data  is  linearly 
separable  and  distributed  uniformly  over  the  unit  sphere  |94,  98,  112|.  There  are  also  simple  examples 
where  active  learning  does  not  help  at  all,  even  in  the  realizable  case  |94|.  Note  that  in  the  active  learn¬ 
ing  model  the  goal  is  to  reduce  the  dependence  on  1/e  from  linear  or  quadratic  to  logarithmic,  and  that 
this  is  somewhat  orthogonal  to  the  goals  considered  in  Chapter  2  where  the  focus  was  on  reducing  the 
complexity  of  the  class  of  functions. 

In  our  work,  we  provide  several  new  theoretical  results  for  Active  Learning.  First,  we  prove  for 
the  first  time,  the  feasibility  of  agnostic  active  learning.  Specifically  we  propose  and  analyze  the  first 
active  learning  algorithm  that  finds  an  e-optimal  hypothesis  in  any  hypothesis  class,  when  the  underlying 
distribution  has  arbitrary  forms  of  noise.  We  also  analyze  margin  based  active  learning  of  linear  separators. 
We  discuss  these  in  Sections  5.1  and  5.2  below,  and  as  mentioned  in  Section  1.2,  these  results  are  based 
on  work  appearing  in  [30,  33,  35  j.  Finally,  in  recent  work  [34,  41  [,  we  also  show  that  in  an  asymptotic 
model  for  Active  Learning  where  one  bounds  the  number  of  queries  the  algorithm  makes  before  it  finds  a 
good  function  (i.e.  one  of  arbitrarily  small  error  rate),  but  not  the  number  of  queries  before  it  knows  it  has 
found  a  good  function,  one  can  obtain  significantly  better  bounds  on  the  number  of  label  queries  required 
to  learn  than  in  the  traditional  active  learning  models. 


5.1  Agnostic  Active  Learning 

In  this  section,  we  provide  and  analyze  the  first  active  learning  algorithm  that  finds  an  e-optimal  hypothe¬ 
sis  in  any  hypothesis  class,  when  the  underlying  distribution  has  arbitrary  forms  of  noise.  The  algorithm. 


107 


(for  Agnostic  Active),  relies  only  upon  the  assumption  that  it  has  access  to  a  stream  of  unlabeled  ex¬ 
amples  drawn  i.i.d.  from  a  fixed  distribution.  We  show  that  A^  achieves  an  exponential  improvement 
(i.e.,  requires  only  O  (in  samples  to  find  an  e-opfimal  classifier)  over  fhe  usual  sample  complexify  of 
supervised  learning,  for  several  seffings  considered  before  in  fhe  realizable  case.  These  include  learn¬ 
ing  fhreshold  classifiers  and  learning  homogeneous  linear  separators  wifh  respecf  to  an  inpuf  disfribufion 
which  is  uniform  over  fhe  unif  sphere. 

5.1.1  Introduction 

Mosf  of  fhe  previous  work  on  acfive  learning  has  focused  on  the  realizable  case.  In  fact,  many  of  the 
existing  active  learning  strategies  are  noise  seeking  on  natural  learning  problems,  because  the  process 
of  actively  finding  an  optimal  separation  befween  one  class  and  anofher  offen  involves  label  queries  for 
examples  close  fo  fhe  decision  boundary,  and  such  examples  offen  have  a  large  condifional  noise  rafe  (e.g., 
due  fo  a  mismafch  befween  fhe  hypofhesis  class  and  fhe  dafa  disfribufion).  Thus  the  most  informative 
examples  are  also  the  ones  that  are  typically  the  most  noise -prone. 

Consider  an  active  learning  algorithm  which  searches  for  the  optimal  threshold  on  an  interval  using 
binary  search.  This  example  is  often  used  to  demonstrate  the  potential  of  active  learning  in  the  noise-free 
case  when  there  is  a  perfect  threshold  separating  the  classes  [86|.  Binary  search  needs  0(ln  -)  labeled 
examples  to  learn  a  threshold  with  error  less  than  e,  while  learning  passively  requires  O  (^)  labels.  A 
fundamental  drawback  of  this  algorithm  is  that  a  small  amount  of  adversarial  noise  can  force  the  algorithm 
to  behave  badly.  Is  this  extreme  brittleness  to  small  amounts  of  noise  essential?  Can  an  exponential 
decrease  in  sample  complexity  be  achieved?  Can  assumptions  about  the  mechanism  producing  noise  be 
avoided?  These  are  the  questions  addressed  here. 

Previous  Work  on  Active  Learning  There  has  been  substantial  work  on  active  learning  under  additional 
assumptions.  For  example,  the  Query  by  Committee  analysis  [112|  assumes  realizability  (i.e.,  existence 
of  a  perfect  classifier  in  a  known  sef),  and  a  correcf  Bayesian  prior  on  fhe  sef  of  hypotheses.  Dasgupta  |94| 
has  identified  sufficienf  conditions  (which  are  also  necessary  againsf  an  adversarially  chosen  disfribufion) 
for  acfive  learning  given  only  fhe  additional  realizabilify  assumpfion.  There  are  several  ofher  papers  fhaf 
assume  only  realizabilify  [93 , 98 1.  If  fhere  exisfs  a  perfecf  hypofheses  in  fhe  concepf  class,  fhen  any  infor¬ 
mative  querying  sfrafegy  can  direcf  fhe  learning  process  wifhouf  fhe  need  fo  worry  abouf  fhe  disfribufion 
if  induces — any  inconsisfenf  hypofhesis  can  be  eliminafed  based  on  a  single  query,  regardless  of  which 
disfribufion  fhis  query  comes  from.  In  fhe  agnosfic  case,  however,  a  hypofhesis  fhaf  performs  badly  on 
fhe  query  distribufion  may  well  be  fhe  optimal  hypofhesis  wifh  respecf  fo  fhe  inpuf  disfribufion.  This  is 
fhe  main  challenge  in  agnosfic  acfive  learning  fhaf  is  nof  present  in  the  non-agnostic  case.  Burnashev  and 
Zigangirov  [75]  allow  noise,  but  require  a  correct  Bayesian  prior  on  threshold  functions.  Some  papers 
require  specific  noise  models  such  as  a  consfanf  noise  rafe  everywhere  [79]  or  Tsybakov  noise  condi¬ 
tions  [33,  78 1 .  (In  facf,  in  section  5.2  we  discuss  acfive  learning  of  linear  separators  under  a  certain  type 
of  noise  related  to  the  Tsybakov  noise  conditions  [33,  78[.) 

The  membership-query  setting  [16,  17,  74,  137[  is  similar  to  active  learning  considered  here,  except 
that  no  unlabeled  data  is  given.  Instead,  the  learning  algorithm  is  allowed  to  query  examples  of  its  own 
choice.  This  is  problematic  in  several  applications  because  natural  oracles,  such  as  hired  humans,  have 
difficulty  labeling  synthetic  examples  [46[.  Ulam’s  Problem  (quoted  in  [90[),  where  the  goal  is  find  a 
disfinguished  element  in  a  set  by  asking  subset  membership  queries,  is  also  related.  The  quantity  of 
interest  is  the  smallest  number  of  such  queries  required  to  find  fhe  elemenf,  given  a  bound  on  fhe  number 
of  queries  fhaf  can  be  answered  incorrecfly.  Buf  both  types  of  results  do  not  apply  here  since  an  active 
learning  strategy  can  only  buy  labels  of  the  examples  it  observes.  For  example,  a  membership  query 


108 


algorithm  can  be  used  to  quickly  hone  on  a  separating  hyperplane  in  a  high-dimensional  space.  An  active 
learning  algorithm  can  not  do  so  when  the  data  distribution  does  not  support  queries  close  to  the  decision 
boundary. ' 

Our  Contributions  We  present  here  the  first  agnostic  active  learning  algorithm,  A^.  The  only  neces¬ 
sary  assumption  is  that  the  algorithm  has  access  to  a  stream  of  examples  drawn  i.i.d.  from  some  fixed 
disfribution.  No  addifional  assumptions  are  made  abouf  fhe  mechanism  producing  noise  (e.g.,  class/fargef 
misfif,  fundamenfal  randomization,  adversarial  situations).  The  main  contribution  of  our  work  is  to  prove 
the  feasibility  of  agnostic  active  learning. 

Two  comments  are  in  order: 

1 .  We  define  the  noise  rate  of  a  hypothesis  class  C  with  respect  to  a  fixed  disfribution  D  as  the  min¬ 
imum  error  rate  of  any  hypothesis  in  C  on  (see  section  2  for  a  formal  definition).  Note  that  for 
the  special  case  of  so  called  label  noise  (where  a  coin  of  constant  bias  is  used  to  determine  whether 
any  particular  example  is  mislabeled  with  respect  to  the  best  hypothesis)  these  definitions  coincide. 

2.  We  regard  unlabeled  data  as  being  of  minimal  so  as  to  focus  exclusively  on  the  question  of  whether 
or  not  agnostic  active  learning  is  possible  at  all.  Substantial  follow-up  to  the  original  publication  of 
our  work  [  30 1  has  successfully  optimized  unlabeled  data  usage  to  be  on  the  same  order  as  passive 
learning  [99|.  ^ 

is  provably  correct  (for  any  0  <  e  <  1/2  and  0  <  5  <  1/2,  it  outputs  an  e-optimal  hypothesis  with 
probability  at  least  1  —  (5)  and  it  is  never  harmful  (it  never  requires  significantly  more  labeled  examples 
than  batch  learning).  provides  exponential  sample  complexity  reductions  in  several  settings  previously 
analyzed  without  noise  or  with  known  noise  conditions.  This  includes  learning  threshold  functions  with 
small  noise  with  respect  to  e  and  hypothesis  classes  consisting  of  homogeneous  (through  the  origin)  linear 
separators  with  the  data  distributed  uniformly  over  the  unit  sphere  in  The  last  example  has  been  the 
most  encouraging  theoretical  result  so  far  in  the  realizable  case  [98|. 

The  analysis  achieves  an  almost  contradictory  property:  for  some  sets  of  classifiers,  an  e-optimal 
classifier  can  be  outpuf  with  fewer  labeled  examples  than  are  needed  to  estimate  the  error  rate  of  the 
chosen  classifier  with  precision  e  from  random  examples  only. 

Lower  Bounds  It  is  important  to  keep  in  mind  that  the  speedups  achievable  with  active  learning  depend 
on  the  match  between  the  distribution  over  example-label  pairs  and  the  hypothesis  class,  and  therefore  on 
the  target  hypothesis  in  the  class.  Thus  one  should  expect  the  results  to  be  distribution-dependent.  There 
are  simple  examples  where  active  learning  does  not  help  at  all  in  the  model  analyzed  in  this  section,  even 
if  there  is  no  noise  [94|.  These  lower  bounds  essentially  result  from  an  “aliasing”  effect  and  they  are 
unavoidable  in  the  setting  we  analyze  in  this  section  (where  we  bound  the  number  of  queries  an  algorithm 
makes  before  it  can  prove  it  has  found  a  good  function).^ 

In  the  noisy  situation,  the  target  function  itself  can  be  very  simple  (e.g.,  a  threshold  function),  but 
if  the  error  rate  is  very  close  to  1/2  in  a  sizeable  interval  near  the  threshold,  then  no  active  learning 
procedure  can  significantly  outperform  passive  learning.  In  particular,  in  the  pure  agnostic  setting  one 

'Note  also  that  much  of  the  work  on  using  membership  queries  1 16  17,  74  137|  has  been  focused  on  problems  where  the  it 
was  not  possible  to  get  a  polynomial  time  learning  algorithm  in  the  passive  learning  setting  (in  a  PAC  sense)  with  the  hope  that 
the  membership  queries  will  allow  learning  in  polynomial  time.  In  contrast,  much  of  the  work  in  the  Active  Learning  literature 
has  been  focused  on  reducing  the  sample  complexity. 

^One  can  show  we  might  end  up  using  a  factor  of  1/e  more  unlabeled  examples  than  the  number  of  labeled  examples  one 
would  normally  need  in  a  passive  learning  setting. 

^In  recent  work  |34,  41 1,  we  have  shown  that  in  an  asymptotic  model  for  Active  Learning  where  one  bounds  the  number  of 
queries  the  algorithm  makes  before  it  finds  a  good  function  (i.e.  one  of  arbitrarily  small  error  rate),  but  not  the  number  of  queries 
before  it  can  prove  or  it  knows  it  has  found  a  good  function,  one  can  obtain  significantly  better  bounds  on  the  number  of  label 
queries  required  to  learn. 


109 


2 

cannot  hope  to  achieve  speedups  when  the  noise  rate  v  is  large,  due  to  a  lower  bound  of  on  the 

sample  complexity  of  any  active  learner  1 143 1.  However,  under  specific  noise  models  (such  as  a  constant 
noise  rate  everywhere  |79|  or  Tsybakov  noise  conditions  [33,  78j)  and  for  specific  classes,  one  can  still 
show  significant  improvement  over  supervised  learning. 

Structure  of  this  section  Preliminaries  and  notation  are  covered  in  Section  5.1.2  is  presented  in 
Section  5.1.3;  Section  5.1.3  also  proves  that  is  correct  and  that  it  is  never  harmful  (i.e.,  it  never  requires 
significantly  more  samples  than  batch  learning).  Threshold  functions  such  as  ft{x)  =  sign{x  —  t)  and 
homogeneous  linear  separators  under  the  uniform  distribution  over  the  unit  sphere  are  analyzed  in  Sec¬ 
tion  5.1.4.  Conclusions,  a  discussion  of  subsequent  work,  and  open  questions  are  covered  in  Section  5.1.6 

5.1.2  Preliminaries 

We  consider  a  binary  agnostic  learning  problem  specified  as  follows.  Let  X  be  an  instance  space  and 

Y  =  {—1, 1}  be  the  set  of  possible  labels.  Let  C  be  the  hypothesis  class,  a  set  of  functions  mapping 

from  X  to  Y.  We  assume  there  is  a  distribution  D  over  instances  in  X,  and  that  the  instances  are  labeled 

by  a  possibly  randomized  oracle  O  (i.e.  the  target  function).  The  oracle  O  can  be  thought  of  as  taking 

an  unlabeled  example  x  in,  choosing  a  biased  coin  based  on  x,  then  flipping  it  to  find  the  label  —1  or  1. 

We  let  P  denote  the  induced  distribution  over  X  xY .  The  error  rate  of  a  hypothesis  h  with  respect  to  a 

distribution  P  over  X  x  y  is  defined  as  errp{h)  =  Pr^^^p[/i(x)  y].  The  error  rate  errp{h)  is  not 

generally  known  since  P  is  unknown,  however  the  empirical  version  Prrp{h)  =  Pr^.  ,^,.,,5[fi(a:)  /  y]  = 

^  y^s  I{h{x)  7^  y)  is  computable  based  upon  an  observed  sample  set  S  drawn  from  P. 

Let  n  =  min  (err/j  o(^))  denote  the  minimum  error  rate  of  any  hypothesis  in  C  with  respect  to  the 
hGC 

distribution  {D,  O)  induced  by  D  and  the  labeling  oracle  O.  The  goal  is  to  find  an  e-optimal  hypothesis, 
i.e.  a  hypothesis  h  ^  C  with  errD,o{h)  within  e  of  u,  where  e  is  some  target  error. 

The  algorithm  relies  on  a  subroutine,  which  computes  a  lower  bound  LB(S',  h,  (5)  and  an  upper 
bound  UB(S',  h,  5)  on  the  true  error  rate  err{h)  of  h  by  using  a  sample  S  of  examples  drawn  i.i.d.  from 
P.  Each  of  these  bounds  must  hold  for  all  h  simultaneously  with  probability  at  least  1  —  5.  The  subroutine 
is  formally  defined  below. 

Definition  5.1.1  A  subroutine  for  computing  LB(S,  h,  5)  and  UB(S,  h,  5)  is  said  to  be  legal  if  for  all 
distributions  P  over  X  xY,  for  all  D  <  5  <  1/2  and  m  G  N, 

LB{S,  h,  5)  <  errp(/i)  <  UB{S,  h,  5) 

holds  for  all  h  £  C  simultaneously,  with  probability  1  —  5  over  the  draw  of  S  according  to  P^. 

Classic  examples  of  such  subroutines  are  the  (distribution  independent)  VC  bound  [202]  and  the 
Occam  Razor  bound  [68],  or  the  newer  data  dependent  generalization  bounds  such  as  those  based  on 
Rademacher  Complexities  [72|.  For  concreteness,  we  could  use  the  VC  bound  subroutine  stated  in  Ap¬ 
pendix  A.  1.1 

As  we  will  see  in  the  following  section,  a  key  point  in  the  algorithm  we  present  is  that  we  will  not 
have  to  bring  the  range  close  to  e  (the  desired  target  accuracy),  but  it  will  be  enough  to  be  constant  width 
on  a  series  of  carefully  chosen  distributions  over  X  xY. 

5.1.3  The  Agnostic  Active  Learner 

At  a  high  level,  A^  can  be  viewed  as  a  robust  version  of  the  selective  sampling  algorithm  of  [  86  j .  Selective 
sampling  is  a  sequential  process  that  keeps  track  of  two  spaces — the  current  version  space  Ci,  defined  as 


110 


the  set  of  hypotheses  in  C  consistent  with  all  labels  revealed  so  far,  and  the  current  region  of  uncertainty 
Ri,  defined  as  the  set  of  all  x  G  X,  for  which  there  exists  a  pair  of  hypotheses  in  Ci  that  disagrees  on 
X.  In  round  i,  the  algorithm  picks  a  random  unlabeled  example  from  Ri  and  queries  it,  eliminating  all 
hypotheses  in  Ci  inconsistent  with  the  received  label.  The  algorithm  then  eliminates  those  x  ^  Ri  on 
which  all  surviving  hypotheses  agree,  and  recurses.  This  process  fundamentally  relies  on  the  assumption 
that  there  exists  a  consistent  hypothesis  in  C.  In  the  agnostic  case,  a  hypothesis  cannot  be  eliminated 
based  on  its  disagreement  with  a  single  example.  Any  algorithm  must  be  more  conservative  in  order  to 
avoid  risking  eliminating  the  best  hypotheses  in  the  class. 

A  formal  specification  of  is  given  in  Algorithm  7  Let  Ci  be  the  set  of  hypotheses  still  under 
consideration  by  in  round  i.  If  all  hypotheses  in  Ci  agree  on  some  region  of  the  instance  space,  this 
region  can  be  safely  eliminated.  To  help  us  keep  track  of  progress  in  decreasing  the  region  of  uncertainty, 
define  DlSAGREE£)(C'i)  as  fhe  probabilify  fhaf  fhere  exisfs  a  pair  of  hypofheses  in  Ci  fhaf  disagrees  on  a 
random  example  drawn  from  D\ 

D  IS  AGREE/)  (Cj)  =  Pr  £  Ci  :  hi{x)  /  h2{x)]. 

X'^D 

Hence  DlSAGREE/)(C'j)  is  fhe  volume  of  fhe  currenf  region  of  uncertainly  wilh  respecl  lo  D. 

Clearly,  fhe  abilily  lo  sample  from  fhe  unlabeled  dala  dislribulion  D  implies  fhaf  ability  lo  compufe 
DlSAGREE/)(C'j).  To  see  Ihis,  note  fhaf:  DiSAGREE/)(Cj)  =  E^rs^D^i^hi,  h2  £  Ci  :  hi{x)  /  h2{x))  is 
an  expeclalion  over  unlabeled  poinls  drawn  from  D.  Consequenlly,  Chernoff  bounds  on  fhe  empirical 
expeclalion  of  a  {0, 1}  random  variable  imply  fhaf  DISAGREE/) (Cj)  can  be  estimated  lo  any  desired 
precision  wilh  any  desired  confidence  using  an  unlabeled  dafasel  wilh  size  limifing  lo  infinily. 

Lei  Di  be  Ihe  dislribulion  D  reslricled  lo  fhe  currenl  region  of  uncertainly.  Formally,  Di  =  D{x  \ 
3/ii,  /i2  G  Ci  :  hi[x)  /  h2{x)).  In  round  i,  A  samples  a  fresh  sel  of  examples  S  from  Di,  O,  and  uses 
il  lo  compute  upper  and  lower  bounds  for  all  hypofheses  in  Ci.  Il  Ihen  eliminates  all  hypofheses  whose 
lower  bound  is  greater  lhan  Ihe  minimum  upper  bound. 

Since  doesnT  label  examples  on  which  Ihe  surviving  hypofheses  agree,  an  optimal  hypolhesis  in 
Ci  wilh  respecl  lo  Di  remains  an  optimal  hypolhesis  in  Cj+i  wilh  respecl  lo  Dj+i.  Since  each  round  i  culs 
DlSAGREE/)(C'j)  down  by  half,  Ihe  number  of  rounds  is  bounded  by  log  Sections  5.1.4  gives  examples 
of  dislribulions  and  hypolhesis  classes  for  which  A^  requires  only  a  small  number  of  labeled  examples  lo 
Iransilion  belween  rounds,  yielding  an  exponential  improvemenl  in  sample  complexity. 

When  evaluating  bounds  during  Ihe  course  of  Algorilhm  7 ,  A^  uses  a  schedule  of  6  according  lo  Ihe 
following  rule:  Ihe  /clh  bound  evaluation  has  confidence  d/.  =  for  fc  >  1.  In  Algorilhm  7,  k  keeps 

Irack  of  Ihe  number  of  bound  compulations  and  i  of  Ihe  number  of  rounds. 

Note:  Il  is  importanl  lo  note  lhal  does  nol  need  lo  know  u  in  advance.  Similarly,  il  does  nol  need  lo 
know  D  in  advance. 

Correctness 

Theorem  5.1.1  (Correctness)  For  all  C,  for  all  {D,  O),  for  all  legal  subroutines  for  computing  UB  and 
LB,  for  all  0  <  e  <  1/2  and  0  <  <5  <  1/2,  with  probability  1  —  5,  A  returns  an  e-optimal  hypothesis  or 
does  not  terminate. 

Note  2  For  most  “reasonable”  subroutines  for  computing  UB  and  LB,  A^  terminates  with  probability 
at  least  1  —  5.  For  more  discussion  and  a  proof  of  this  fact  see  Section  5.1.3 

Proof:  The  first  claim  is  that  all  bound  evaluations  are  valid  simultaneously  with  probability  at  least 
1  —  (5,  and  the  second  is  that  the  procedure  produces  an  e-optimal  hypothesis  upon  termination. 


Ill 


Algorithm  7  (allowed  error  rate  e,  sampling  oracle  for  D,  labeling  oracle  O,  hypothesis  class  C) 
set  i  ^  1,  Di  ^  D,  Ci  ^  C,  Cj_i  ^  C,  Si-i  ^  0,  and  k  ^  1. 

(1)  while  Disagreed  (Cj-i) 

set  Si  ^  0,  C'  ^  Cj,  fc  ^  A:  +  1 


min  l]B{Si-i,h,^k)  -  min  LB{Si-i,h,  fik) 


>  e 


(2)  while  DisagreedCC')  >  1  DISAGREEdCCj) 

if  DisagreEdCCj)  (min  UB(S'i, /i, /ifc)  —  min  LB(5i, /i, /r^))  <  e 

heCi  h&Ci 

{*)  return  h  =  argnin/jeCi  UB(S'i,  h,  /i^). 
else  5"'  =  rejection  sample  2|5i|  +  1  samples  x  from  D  satisfying 

3/ii,/i2  G  Ci  :  hi{x)  /  h2{x). 


Si  ^  SiVJ  {{x,  0{x))  :  X  G  S'},  k  ^  k  +  1 

{**)  C'  =  {h  e  Ci  :  LB(Sj,  h,  )  <  min  UB(Si,  h',  /rfc)},  k  ^  k  +  1 

h'GCi 

end  if 
end  while 

Ci+i  ^  C',  A+i  ^  A  restricted  to  {x  :  3/ii,  /i2  G  C'  :  hi{x)  / 

i  <—  z  +  1 

end  while 

return  h  =  argpiinftgc'i-i  UB(5i_i,  h,  Hk). 


To  prove  the  first  claim,  notice  that  the  samples  on  which  each  bound  is  evaluated  are  drawn  i.i.d. 
from  some  distribution  over  X  x  Y.  This  can  be  verified  by  nofing  fhaf  fhe  disfribufion  Di  used  in  round 
i  is  precisely  fhaf  given  by  drawing  x  from  fhe  underlying  disfribufion  D  condifioned  on  fhe  disagreemenf 
3/ii,  /i2  G  A  :  hi{x)  /  h2{x),  and  fhen  labeling  according  fo  fhe  oracle  O. 

The  k-th  bound  evaluafion  fails  wifh  probabilify  af  mosf  By  fhe  union  bound,  fhe  probabilify 

fhaf  any  bound  fails  is  less  fhen  fhe  sum  of  fhe  probabilities  of  individual  bound  failures.  This  sum  is 
bounded  by  ET=i  k(in)  = 

To  prove  fhe  second  claim,  nofice  firsl  fhaf  since  every  bound  evaluafion  is  correcf,  sfep  (**)  never 
eliminafes  a  hypofhesis  fhaf  has  minimum  error  rafe  wifh  respecf  {D,  O).  Lef  us  now  infroduce  fhe  fol¬ 
lowing  nofafion.  For  a  hypofhesis  h  £  C  and  G  C  C  dehne: 


eD,G,o{h) 


Pr  [h(x)  /  y], 

x,yr^D,0\3hi,h2(zG:hi{x)^h2(x) 


fD,G,o{h) 


Pr  [h{x)  /  y]. 

x,y^D,0\'ihi  ,/i2 &G:hi  {x)=h2  (x) 


112 


Notice  that  eD,G,o{h)  is  in  fact  erru^^oih),  where  Dq  is  D  conditioned  on  the  disagreement  3/ii,  /12  G 
G  :  hi{x)  /  h2{x).  Moreover,  given  any  G  ^  C,  the  error  rate  of  every  hypothesis  h  decomposes  into 
two  parts  as  follows: 


errn^oih)  =  £0,0, o{h)  •  DisagreE£)(G)  +  fD,G,o{h)  •  (1  —  DisagreEd(G)) 

=  errDcoih)  ■  Disagreed(G)  +  fD,G,o{h)  ■  (1  -  Disagreed(G)). 

Notice  that  the  only  term  that  varies  with  h  £  G  in  the  above  decomposition,  is  eD,G,o{h).  Conse¬ 
quently,  finding  an  e-optimal  hypothesis  requires  only  bounding  err o(^)  •DisagreE/)(G)  to  precision 
e.  But  this  is  exactly  what  the  negation  of  the  main  while-loop  guard  does,  and  this  is  also  the  condition 
used  in  the  first  step  of  the  second  while  loop  of  the  algorithm.  In  other  words,  upon  termination 
satisfies 

DISAGREE/) (Cj) (min  UB(S'j,  h,  6k)  —  min  LB(5'j,  h,  6k))  <  e, 

heCi  h&Gi 

which  proves  the  desired  result.  ■ 


Fail-back  Analysis 

This  section  shows  that  is  never  much  worse  than  a  standard  batch,  bound-based  algorithm  in  terms 
of  the  number  of  samples  required  in  order  to  learn.  (A  standard  example  of  a  bound-based  learning 
algorithm  is  Empirical  Risk  Minimization  (ERM)  [203|.) 

The  sample  complexity  m(e,  6,  G)  required  by  a  batch  algorithm  that  uses  a  subroutine  for  computing 
EB(S',  h,  (5)  and  UB(S',  h,  6)  is  defined  as  fhe  minimum  number  of  samples  m  such  fhaf  for  all  S  £  X^, 
|UB(S',  h,  (5)  —  EB(5,  h,6)\  <  e  for  all  h  £  C.  Eor  concreteness,  fhis  secfion  uses  fhe  following  bound  on 
m{e,  6,  G)  slated  as  Theorem  A.  1.1  in  Appendix  A.1.1 : 


64 


/12 


m(e,  6,G)  =  ^  [  ‘^Vg  In  —  +  In  - 


Here  Vc  is  fhe  VC-dimension  of  G.  Assume  lhal  m(2e,  6,  H)  <  and  also  fhaf  fhe  function  m  is 

monolonically  increasing  in  1/6.  These  condifions  are  satisfied  by  many  subroutines  for  compuling  UB 
and  EB,  including  fhose  based  on  fhe  VC-bound  [202]  and  fhe  Occam’s  Razor  bound  |68|. 

Theorem  5.1.2  For  all  C,  for  all  {D,  O),  for  all  UB  and  LB  satisfying  the  assumption  above,  for  all 
0  <  e  <  1/2  and  0  <  <5  <  1/2,  the  algorithm  makes  at  most  2m{e,  6/  H)  calls  to  the  oracle  O,  where 
=  N(eAG)(N(eAG)+i)  ^ Satisfies  N{e,  6,  C)  >  In  ^  In  m(e,  N{e,s,c){N{e,5,c)+i)’ 
m(e,  6,  H)  is  the  sample  complexity  of  UB  and  LB. 

Proof:  Eel  6k  =  k{k+i)  confidence  paramefer  used  in  fhe  k-th  applicafion  of  fhe  subrouline  for 

compuling  UB  and  EB.  The  proof  works  by  finding  an  upper  bound  N{e,  6,  G)  on  fhe  number  of  bound 
evaluations  Ihroughouf  fhe  life  of  fhe  algorifhm.  This  implies  fhaf  fhe  confidence  paramefer  6k  is  always 
greater  fhan  ^  ' 

Recall  lhal  Dj  is  fhe  dislriliulion  over  x  used  on  fhe  zlh  iteration  of  fhe  firsl  while  loop.  Consider  i  =  1. 
If  condition  2  of  Algorifhm  is  repeatedly  satisfied  fhen  afler  labeling  m(e,  6\  G)  examples  from  Di 
for  all  hypolheses  h  £  Gi, 

|UB(5i,/i,<5')  -LB(5i,/i,(5')|  <  e 

simulfaneously.  Nofe  fhaf  in  fhese  condifions  safely  halls.  Nofice  also  fhaf  fhe  number  of  bound 
evaluations  during  fhis  process  is  af  mosl  log2  m(e,  6',G). 


113 


On  the  other  hand,  if  loop  (2)  ever  completes  and  i  increases,  then  it  is  enough,  if  you  finish  when 
i  =  2,  to  have  uniformly  for  all  h  £  C2, 

|UB(52,/i,-5') <  2e. 

(This  follows  from  the  exit  conditions  in  the  outer  while-loop  and  the  ‘if’  in  Step  2  of  A^.)  Uniformly 
bounding  the  gap  between  upper  and  lower  bounds  over  all  hypotheses  /i  G  C2  to  within  2e,  requires 
m(2e,  5\  C)  <  labeled  examples  from  D2  and  the  number  of  bound  evaluations  in  round  i  =  2 

is  at  most  log2  m(e,  <5',  C). 

In  general,  in  round  i  it  is  enough  to  have  uniformly  for  all  h  G  Ci, 


|UB(5,,  h,  5')  -  LB{Si,  h,  S')  |  <  2*-^e, 

and  which  requires  m(2*“^e,  5',  C)  <  labeled  examples  from  Uj.  Also  the  number  of  bound 

evaluations  in  round  i  is  at  most  log2  m(e,  5' ,C). 

Since  the  number  of  rounds  is  bounded  by  log2  it  follows  that  the  maximum  number  of  bound 
evaluations  throughout  the  life  of  the  algorithm  is  at  most  log2  \  log2  m(e,  5' ,C).  This  implies  that  in 
order  to  determine  an  upper  bound  N{e,  5,  C)  only  a  solution  to  the  inequality: 

N(,AC)>  log,  1  log, m  (0,  ;v(,.4,C)(iVfci,C)  +  l)- 


is  required. 

Finally,  adding  up  the  number  of  calls  to  the  label  oracle  O  in  all  rounds  yields  at  most  2m(e,  5' ,  C) 
over  the  life  of  the  algorithm.  ■ 

Let  Vc  denote  the  VC-dimension  of  C,  and  let  m(e,  5,  C)  be  the  number  of  examples  required  by 
the  ERM  algorithm.  As  stated  in  Theorem  A.  1.1  in  Appendix  A.l.L  a  classic  bound  on  m(e,  5,  C)  is 
m(e,  6,C)  =  ^  (2Vcl'n  (^)  +  In  (j)).  Using  Theorem  5.1.2,  the  following  corollary  holds. 

Corollary  5.1.3  For  all  hypothesis  classes  C  of  VC-dimension  Vc,  for  all  distributions  {D,  O)  over  X  x 
Y,  for  all  0  <  e  <  1/2  and  0  <  <5  <  1/2,  the  algorithm  requires  at  most  O  (^(Vcln^  +ln|)) 
labeled  examples  the  oracle  O. 

Proof:  The  form  of  m(e,  <5,  H)  and  Theorem  5.1.2  implies  an  upper  bound  on  =  N{e,  6,  H).  It  is 
enough  to  find  the  smallest  N  satisfying 


>  In 


(2Ucln 


Using  the  inequality  In  a  <  aft  —  In  5  —  1  for  all  a,  ft  >  0  and  some  simple  algebraic  manipulations,  the 
desired  upper  bound  on  N{e,  ft,  C)  holds.  The  result  then  follows  from  Theorem  5.1.2  ■ 


5.1.4  Active  Learning  Speedups 

This  section  gives  examples  of  exponential  sample  complexity  improvements  achieved  by  A^. 


114 


Learning  Threshold  Functions 


Linear  threshold  functions  are  the  simplest  and  easiest  to  analyze  class.  It  turns  out  that  even  for  this  class, 
exponential  reductions  in  sample  complexity  are  not  achievable  when  the  noise  rate  v  is  large  [  143 1.  We 
prove  the  following  three  results: 

1.  An  exponential  improvement  in  sample  complexity  when  the  noise  rate  is  small  (Theorem  5.1.4). 

2.  A  slower  improvement  when  the  noise  rate  is  large  (Theorem  5.1.5 ). 

3.  An  exponential  improvement  when  the  noise  rate  is  large  but  due  to  constant  label  noise  (Theo¬ 
rem  5.1.6).  This  shows  that  for  some  forms  of  high  noise  exponential  improvement  remains  possi¬ 
ble. 

All  results  in  this  subsection  assume  that  subroutines  LB  and  UB  in  are  based  on  the  VC  bound. 

Theorem  5.1.4  Let  C  be  the  set  of  thresholds  on  an  interval.  For  all  distributions  {D,  O)  where  D  is  a 
continuous  probability  distribution  function,  for  any  e  <  ^  and  jq  >  v,  the  algorithm  makes 


calls  to  the  oracle  O  on  examples  drawn  i.i.d.  from  D,  with  probability  1  —  5. 

Proof:  Consider  round  i  >  1  of  the  algorithm.  For  C  Ci,  let  di{hi,h2)  be  the  probability 

that  hi  and  /i2  predict  differently  on  a  random  example  drawn  according  to  the  distribution  Di,  i.e., 

di{hi,  h2)  =  Pr^r.,Di[diix)  /  h2{x)]. 

Let  h*  be  any  minimum  error  rate  hypothesis  in  C.  Note  that  for  any  hypothesis  h  G  Ci,  we  have 
errD,,o{h)  >  di{h,h*)-errDi,o{h*)  and  err  D„o{h*)  <  iz/Zj,  where  Z*  =  Pra;..,.^[x  G  [loweri,  upper  i\] 
is  a  shorthand  for  DlSAGREE£)(C'j)  and  [loweri,  upperi]  denotes  the  support  of  Di.  Thus  err£)-^o{h*)  < 
di{h,h*)  -  vjZi. 

We  will  show  that  at  least  a  ^-fraction  (measured  with  respect  to  Dj)  of  thresholds  in  Ci  satisfy 
di{h,  h*)  >  and  these  thresholds  are  located  at  the  ends  of  the  interval  [loweri,  upperi].  Assume  first 
that  both  di{h* ,  loweri)  >  j  and  di{h* ,  upperi)  >  then  let  k  and  Ui  be  the  hypotheses  to  the  left  and 
to  the  right  of  h* ,  respectively,  that  satisfy  dfh* ,li)  =  \  and  dfh* ,Ui)  =  |.  All  h  G  [loweri, li]  U 
[ui,  upperi]  satisfy  dfh* ,  /i)  >  |  and  moreover 

Pr  [x  G  [loweri,  li]  U  [ui,  upperi]]  > 

X'^Di  Z 


Now  suppose  that  di{h* ,  loweri)  <  Let  Ui  be  the  hypothesis  to  the  right  of  h*  with  di{h,  upperi)  =  \- 
Then  all /i  G  [ui,  upperi]  satisfy  dj(/i*, /i)  >  \  and  moreover  Pr2,..^£)Ja:  G  [ui,  upperi]]  >  A  similar 
argument  holds  for  dj(/i*,  upper j)  <  \. 


Using  the  VC  bound,  with  probability  1  —  5',  if  |5i|  =  O  [  — ^  )  >  then  for  all  hypotheses  h  ^  Ci 

V  J 

simultaneously,  |UB(5j,  h,  <5)  —  LB(S'i,  h,6)]  <  |  —  ;f:  holds.  Note  that  o/Zi  is  always  upper  bounded 

by  H- 

Consider  a  hypothesis  h  ^  Ci  with  di{h,  h*)  >  j.  For  any  such  h, 


errDi,o{h)  >  dfh,  h*)  -  vjZi  > 


1 

4 


V 


and  so 


LB(Si,/i,5)  >  ^ 


V 

% 


1 

8' 


115 


On  the  other  hand,  errDi,o{h*)  <  and  so 


Zji  o  Zji 


1 

8' 


Thus  eliminates  all G  Cj  with  di(/i,  h*)  >  But  that  means  DlSAGREE£)(C'')  <  ^DlSAGREE£)(C'i), 
thus  terminating  round  i?" 

Eaeh  exit  from  while  loop  (2)  deereases  DlSAGREE£)(C'j)  by  at  least  a  factor  of  2,  implying  that  the 
number  of  executions  is  bounded  by  log  The  algorithm  makes  O  (in  (^)  In  (^))  calls  to  the  oracle, 
where  b'  =  jv(g  ^  c)(jv(e  <5  C)+i)  upper  bound  on  the  number  of  bound  evaluations 

throughout  the  life  of  the  algorithm. 

The  number  of  bound  evaluations  required  in  round  i  is  O  (in  ^) ,  which  implies  that  N{e,b,C)  should 
satisfy 


cln 


|^iV(e,J,C)(jV(g,^,C)  +  l)^ 


<N{e,6,C), 


for  some  constant  c.  Solving  this  inequality  completes  the  proof.  ■ 


Theorem  5.1.5  below  asymptotically  matches  a  lower  bound  of  Kaariainen  [  143 1.  Recall  that  does 
not  need  to  know  u  in  advance. 

Theorem  5.1.5  Let  C  be  the  set  of  thresholds  on  an  interval.  Suppose  that  e  <  |  and  v  >  16e.  For  all 

D,  with  probability  1  —  b,  the  algorithm  requires  at  most  O 

Proof:  The  proof  is  similar  to  the  previous  proof.  Theorem  5.1.4  implies  that  loop  (2)  completes 
0(log  -)  times.  At  this  point,  the  minimum  error  rate  of  the  remaining  hypotheses  conditioned  on  dis¬ 
agreement  becomes  sufficient  so  that  the  algorithm  may  only  halt  via  the  return  step  (*).  In  this  case, 

~  / In  —  \ 

DisagreE£)(C')  =  Q{v)  implying  that  the  number  of  samples  required  is  O  (  — j .  ■ 


In  - 


labeled  samples. 


The  final  fheorem  is  for  fhe  consfanf  noise  case  where  |  v]  ~  ^\  =  for  all  x  G 

The  fheorem  is  similar  fo  earlier  work  [75 1,  excepf  fhaf  we  achieve  fhese  improvemenfs  wifh  a  general 
purpose  active  learning  algorifhm  fhaf  does  nof  use  any  prior  over  fhe  hypofhesis  space  or  knowledge  of 
fhe  noise  rale,  and  is  applicable  lo  arbilrary  hypofhesis  spaces. 


Theorem  5.1.6  Let  C  be  the  set  of  thresholds  on  an  interval.  For  all  unlabeled  data  distributions  D,  for 
all  labeled  data  distributions  O,  for  any  constant  label  noise  u  <  1/2  and  e  <  5,  the  algorithm  A"^  makes 

i-(A)‘  ‘ 


O 

1-b. 


Dr¬ 


eads  to  the  oracle  O  on  examples  drawn  i.i.d.  from  D,  with  probability 


The  proof  is  essentially  fhe  same  as  for  Theorem  5.1.4,  excepf  fhaf  fhe  consfanf  label  noise  condi- 
lion  implies  fhaf  fhe  amounl  of  noise  in  fhe  remaining  aclively  labeled  subsel  slays  bounded  fhrough  fhe 
recursions. 

Proof:  Consider  round  i  >  1.  For  G  Ci,  lei  di{hi,h2)  =  FTxr^Di[hi{x)  /  /i2(x)].  Nofe  fhaf 

for  any  hypofhesis  h  G  Ci,  we  have  errDi,o{h)  =  dfh,  h*){l  —  2v)  +  v  and  err£).^o{h*)  =  u,  where  h* 
is  a  minimum  error  rale  fhreshold. 

As  in  fhe  proof  of  Theorem  5. 1.4,  al  leasl  a  ^-fraction  (measured  wifh  respecl  lo  DA  of  Ihresholds  in 
Ci  satisfy  di{h,  h*)  >  and  fhese  Ihresholds  are  localed  al  fhe  ends  of  fhe  supporl  [lowevi,  uppevi]  of 


'’The  assumption  in  the  theorem  statement  can  be  weakened  to  u  < 


(5Ti)73  for  any  constant  A  >0. 


116 


Di.  The  VC  bound  implies  that  for  any  5'  >  0  with  probability  1  —  5',  if  [S’, |  =  O  ( 
hypotheses  h  G  Ci  simultaneously,  |UB(5i,  h,  6)  —  LB(5i,  h,6)\  < 


then  for  all 


Consider  a  hypothesis  h  G  Ci  with  di{h,  h*)  >  4.  For  any  such  h,  err]:)^^o{h)  > 


l-2iy 


X  auv^ii  i  o,  oi  /  Ui,UV‘')  ^  4  I"  ^  ~  4  +  2’ 

and  so  LB(S'j,/i,  5)  >  |  +  |  ~  |(1  ~  2z/)  =  |  +  ^-  On  the  other  hand,  errD^^o{h*)  =  v,  and  so 
\JB{Si,h* ,6)  <  +  (|  —  3)  =  I  +  X-  Thus  eliminates  all  h  G  Ci  with  di{h,h*)  >  3-  But  this 

means  that  DISAGREE/) (C')  <  ^DlSAGREE/)(C'j),  thus  terminating  round  i. 

Finally  notice  that  makes  O  (in  ( J7)  In  calls  to  the  oracle,  where  6'  =  jv(e  g  c){N{e  (5  C)+i) 
and  N{e,  6,  C)  is  an  upper  bound  on  the  number  of  bound  evaluations  throughout  the  life  of  the  algorithm. 
The  number  of  bound  evaluations  required  in  round  i  is  0(ln(l/(5')),  which  implies  that  the  number  of 
bound  evaluations  throughout  the  life  of  the  algorithm  N{e,  <5,  C)  should  satisfy 


cln 


N{eA,C){N{eA.C)  +  l) 


In  -  <N{eA,C), 


for  some  constant  c.  Solving  this  inequality,  completes  the  proof. 


Linear  Separators  under  the  Uniform  Distribution 


A  commonly  analyzed  case  for  which  active  learning  is  known  to  give  exponential  savings  in  the  number 
of  labeled  examples  is  when  the  data  is  drawn  uniformly  from  the  unit  sphere  in  TZ'^,  and  the  labels 
are  consistent  with  a  linear  separator  going  through  the  origin.  Note  that  even  in  this  seemingly  simple 
scenario,  there  exists  an  D  (d  +  log  j))  lower  bound  on  the  PAC  passive  supervised  learning  sample 
complexity  1 165 1.  We  will  show  that  A^  provides  exponential  savings  in  this  case  even  in  the  presence  of 
arbitrary  forms  of  noise. 

Let  X  =  {x  G  :  ||a:||  =  1},  the  unit  sphere  in  Assume  that  D  is  uniform  over  X,  and  let  C  be 
the  class  of  linear  separators  through  the  origin.  Any  /i  G  C  is  a  homogeneous  hyperplane  represented  by 
a  unit  vector  w  G  X  with  the  classification  rule  h{x)  =  sign(t(;  •  x).  The  distance  between  two  hypotheses 
u  and  vinC  with  respect  to  a  distribution  D  (i.e.,  the  probability  that  they  predict  differently  on  a  random 
example  drawn  from  D)  is  given  by  dD{u,v)  =  Finally,  let  9{u,v)  =  arccos(n  •  v).  Thus 

dD{u,v)  = 

In  this  section  we  sill  use  a  classic  lemma  about  the  uniform  distribution.  For  a  proof  see,  for  exam¬ 
ple,  [33.981. 

Lemma  5.1.7  For  any  fixed  unit  vector  w  and  any  0  <  7  <  1, 


7  ,  ,  7 

-  <  Pr  \w  ■  x\  < 

4  ^  L  Vd 


<  7, 


where  x  is  drawn  uniformly  from  the  unit  sphere. 

Theorem  5.1.8  Let  X,  C,  and  D  be  as  defined  above,  and  let  LB  and  UB  be  the  VC  bound.  Then  for  any 
0<e<^,  0<z/<  cmd  d  >  0,  with  probability  1  —  6,  requires 


O 


d  (  d  In  d  +  In  ^  j  In  - 


calls  to  the  labeling  oracle,  where  5' 


& 

N{e,&,C)(N(e,5,C)+l) 


and 


N{e,6,C) 


0(1.7 


^d^  In  d  +  din 


117 


Proof:  Let  w*  G  C  be  a  hypothesis  with  the  minimum  error  rate  o.  Denote  the  region  of  uncertainty 
in  round  i  by  Ri.  Thus  Pixf^D[x  G  Ri]  =  DlSAGREE£)(C'i).  Consider  round  i  of  A^.  We  prove  that 
the  round  completes  with  high  probability  if  a  certain  threshold  on  the  number  of  labeled  examples  is 
reached.  The  round  may  complete  with  a  smaller  number  of  examples,  but  this  is  fine  because  the  metric 
of  progress  DlSAGREE£)(C'i)  must  halve  in  order  to  complete. 

Theorem  A.1.1  says  that  it  suffices  fo  query  fhe  oracle  on  a  sef  S  of  0{(Plnd  +  din  J7)  examples 
from  flh  disfribufion  Di  fo  guaranfee,  wifh  probabilify  1  —  6',  fhaf  for  all  w  G  Ci, 


\errDi,o{w)  -  e?'rD,,o{w)\  <  ^ 


where  rj  is  a  shorfhand  for  DlSAGREE£)(Cj).  (By  assumption,  v  <  and  fhe  loop  guard  guarantees 
fhaf  DlSAGREE£)(C'j)  >  e.  Thus  fhe  precision  above  is  af  leasf  This  implies  fhaf  UB(5',  w,  S')  — 

errD„o{w)  <  ^  errDi,o{w)  -  LB(S',  m,d')  <  ^  Consider  any  w  £  Ci  wifh 

doXw^w*)  >  For  any  such  w,  eno„o{w)  >  ^  -  X’ 


LB  (S',  w,S')  > 


1 

4Vd  ri 


sVd 


u 

+  -  = 


8\/d 


However,  and  fhus  UB(S,  w* ,5')  <  ^  —  y.  =  so  eliminates  w  in  step 

(**). 

Thus  round  i  eliminates  all  hypofheses  w  £  Ci  wifh  d£).(m,  m*)  >  Since  all  hypofheses  in  Ci 
agree  on  every  x  X  Ri, 

j  ^  0{w,w*) 

dDi{w,w  )  =  —dD{w,w  )  = - . 

Tj  TTVi 

Thus  round  i  eliminates  all  hypofheses  w  £  Ci  wifh  9{w,w*)  >  Buf  since  2d/7r  <  sind,  for 

6  £  (0,  |],  if  cerfainly  eliminates  all  w  wifh  sin  9{w,  w*)  > 

Consider  any  X  G  i?i+i  and  fhe  value  =  cos9{w*  ,x).  There  musf  exisf  a  hypofhesis  m  G  Cj+i 

fhaf  disagrees  wifh  w*  on  x;  ofherwise  x  would  nol  be  in  Ri+i.  Buf  fhen  cos  9{w* ,x)  <  cos(|  — 
9{w,w*))  =  sm9{w,w*)  <  where  fhe  lasf  inequalify  is  due  fo  fhe  facf  fhaf  A'^  eliminates  all 
w  with  sm9{w,w*)  >  Thus  any  x  G  Ri+i  must  satisfy  \w*  ■  x\  <  Using  the  fact  that 

I  ^  FTpl  ^ 


Pr  [x  G  d2j+i]  <  Pr 

x^Di  X'^Di 


W  ■  x\  <  - i= 

2y/d 


Pr 


Xr^D 


< 


\w  ■  x\  < 

'  '  —  2Cd 


Pr 


X'^D 


X  £  Ri 


1 

2’ 


where  the  third  inequality  follows  from  Lemma  5.1.7  Thus  DlSAGREE£)(Ci+i)  <  2DlSAGREE£)(C'j),  as 
desired. 

In  order  to  finish  fhe  argumenf,  if  suffices  fo  notice  fhaf  since  every  round  cufs  DlSAGREE£)(C'j)  af 
leasf  in  half,  fhe  fofal  number  of  rounds  is  upper  bounded  by  log  Notice  also  fhaf  fhe  A^  algorifhm 
makes  O  (d^  In  d  +  din  J7)  In  (^)  calls  fo  fhe  oracle,  where  S'  =  jv(e  ^c)(jv(e  <5C)+i) 
an  upper  bound  on  fhe  number  of  bound  evaluafions  fhroughouf  fhe  life  of  fhe  algorifhm.  The  number 

^  The  assumption  in  the  theorem  statement  can  be  weakened  iov<  for  any  constant  A  >  0. 


118 


of  bound  evaluations  required  in  round  i  is  O  [(P  In  d  +  d  In  .  This  implies  that  the  number  of  bound 
evaluations  throughout  the  life  of  the  algorithm  N{e,  <5,  C)  should  satisfy 


c 


^d^  In  d  +  d  In 


|^jV(6,d,C)(iV(e,d,C)  +  l)^^ 


<N{e,6,C), 


for  some  constant  c.  Solving  this  inequality,  completes  the  proof.  ■ 


Note:  For  comparison,  the  query  complexity  of  the  Perceptron-based  active  learning  algorithm  of  |98| 
is  0(dln  ^(In  ^  +  Inin  ^)),  for  the  same  C,  X,  and  D,  but  only  for  the  realizable  case  when  v  =  0.® 
Similar  bounds  are  obtained  in  |33|  both  in  the  realizable  case  and  for  a  specific  form  of  noise  related  to 
the  Tsybakov  small  noise  condition.  (We  present  these  results  in  Section  5.2  )  The  cleanest  and  simplest 
argument  that  exponential  improvement  is  in  principle  possible  in  the  realizable  case  for  the  same  C,  X, 
and  D  appears  in  [94|.  Our  work  provides  the  first  justification  of  why  one  can  hope  to  achieve  similarly 
strong  guarantees  in  the  much  harder  agnostic  case,  when  the  noise  rate  is  sufficiently  small  with  respect 
to  the  desired  error. 


5.1.5  Subsequent  Work 

Following  the  initial  publication  of  A^,  Hanneke  has  further  analyzed  the  A  algorithm  1 130|,  deriving  a 
general  upper  bound  on  the  number  of  label  requests  made  by  A^.  This  bound  is  expressed  in  terms  of 
particular  quantity  called  the  disagreement  coefficient,  which  roughly  quantifies  how  quickly  the  region 
of  disagreement  can  grow  as  a  function  of  the  radius  of  the  version  space.  For  concreteness  this  bound  is 
included  below. 

In  addition,  Dasgupta,  Hsu,  and  Monteleoni  |99|  introduce  and  analyze  a  new  agnostic  active  learning 
algorithm.  While  similar  to  A^,  this  algorithm  simplifies  the  maintenance  of  the  region  of  uncertainty  with 
a  reduction  to  supervised  learning,  keeping  track  of  the  version  space  implicitly  via  label  constraints. 

Subsequent  Guarantees  for  A 

This  section  describes  the  disagreement  coefficient  |130|  and  the  guarantees  it  provides  for  the  A  algo¬ 
rithm.  We  begin  with  a  few  additional  definitions,  in  the  notation  of  Section  5.1.2. 

Definition  5.1.2  The  disagreement  rate  A(F)  of  a  setVCC  is  defined  as 

A{V)  =  Pr  [x  e  Disagreez)(P)]. 

X'^D 

Definition  5.1.3  For  /i  G  C,  r  >  0,  let  B{h,  r)  =  {/i'  G  C  :  d{h' ,  h)  <  r}  and  define  the  disagreement 
rate  at  radius  r  as 

Ar  =  sup  {A{B{h,  r))). 
hec 

The  disagreement  coefficient  is  the  infimum  value  o/0  >  0  such  that  Vr  >  z/  +  e, 

Ar  <  Or. 


We  now  present  the  main  result  of  1 130|. 

®Note  also  that  it  is  not  clear  if  the  analysis  in  |98 1  is  extendable  to  commonly  used  types  of  noise,  e.g.,  Tsybakov  noise. 


119 


Theorem  5.1.9  If  6  is  the  disagreement  coefficient  for  C,  then  with  probability  at  least  1  —  5,  given  the 
inputs  e  and  5,  outputs  an  e-optimal  hypothesis  h.  Moreover,  the  number  of  label  requests  made  by  Af 
is  at  most: 


O 


Vc  In  -  +  In  ^  )  In  - 
e  0  /  e 


where  Vc  >  1  is  the  VC-dimension  of  C. 

As  shown  in  [  130 1  for  the  concept  space  C  of  thresholds  on  an  interval  the  disagreement  coefficients^  = 
2.  Also  X  =  {x  ^  :  ||x||  =  1}  is  the  unit  sphere  in  IZ‘^,  D  is  uniform  over  X,  and  let  C  be  the  class 

of  linear  separators  through  the  origin,  then  the  disagreement  coefficient  6  satisfies 


min  <  TT 


Ts/d, - I  <  0  <  min  <  TTs/d, - I 

ffi  +  e]  -  -  \  '  v  +  e\ 


These  clearly  mafch  fhe  resulfs  in  Secfions  5.1.4  and  5.1.4 


5.1.6  Conclusions 

We  presenf  here  A^,  fhe  firsl  active  learning  algorifhm  fhaf  finds  an  e-opfimal  hypofhesis  in  any  hypofhesis 
class,  when  fhe  disfribufion  has  arbifrary  forms  of  noise.  The  algorifhm  relies  only  upon  fhe  assumption 
fhaf  fhe  samples  are  drawn  i.i.d.  from  a  fixed  (unknown)  disfribufion,  and  if  does  nof  need  fo  know  fhe 
error  rafe  of  fhe  besf  classifier  in  fhe  class  in  advance.  We  analyze  A?  for  several  sellings  considered 
before  in  fhe  realizable  case,  showing  fhaf  achieves  an  exponential  improvemenl  over  fhe  usual  sample 
complexify  of  supervised  learning  in  Ihese  sellings.  We  also  provide  a  guaranlee  fhaf  A^  never  requires 
subsfanfially  more  labeled  examples  fhan  passive  learning. 

A  more  general  open  question  is  whal  conditions  are  sufficienl  and  necessary  for  acfive  learning  lo 
succeed  in  fhe  agnostic  case.  Whaf  is  fhe  righf  quanlily  fhaf  can  characlerize  fhe  sample  complexify  of 
agnoslic  active  learning?  As  mentioned  already,  some  progress  in  Ibis  direclion  has  been  recenlly  made 
in  [130 1  and  [99  j;  however,  Ihose  resulfs  characlerize  non-aggressive  agnostic  active  learning.  Deriving 
and  analyzing  fhe  optimal  agnostic  acfive  learning  strategy  is  still  an  open  question. 

Much  of  the  existing  literature  on  active  learning  has  been  focused  on  binary  classification;  it  would 
be  interesting  to  analyze  active  learning  for  other  loss  functions.  The  key  ingredient  allowing  recursion  in 
the  proof  of  correctness  is  a  loss  that  is  unvarying  with  respect  to  substantial  variation  over  the  hypothesis 
space.  Many  losses  such  as  squared  error  loss  do  not  have  this  property,  so  achieving  substantial  speedups, 
if  that  is  possible,  requires  new  insights.  For  other  losses  with  this  property  (such  as  hinge  loss  or  clipped 
squared  loss),  generalizations  of  appear  straightforward. 


5.2  Margin  Based  Active  Learning 

A  common  feature  of  the  selective  sampling  algorithm  [86[,  A^,  and  others  [99  [  is  that  they  are  all  non- 
aggressive  in  their  choice  of  query  points.  Even  points  on  which  there  is  a  small  amount  of  uncertainty 
are  queried,  rather  than  pursuing  the  maximally  uncertain  point.  We  show  here  that  a  more  aggressive 
strategies  can  generally  lead  to  better  bounds.  Specifically,  we  analyze  a  margin  based  active  learning 
algorifhm  for  learning  linear  separafors  and  insfanliafe  if  for  a  few  imporfanf  cases,  some  of  which  have 
been  previously  considered  in  fhe  liferafure.  The  generic  procedure  we  analyze  is  Algorifhm  8  The  key 
confribufions  of  fhis  secfion  are  fhe  following: 


120 


1 .  We  point  out  that  in  order  to  get  a  labeled  data  sample  complexity  which  has  a  logarithmic  depen¬ 
dence  on  1/e  without  increasing  the  dependence  on  d  (i.e.,  a  truly  exponential  improvement  in  the 
labeled  data  sample  complexity  over  the  passive  learning)  we  have  to  use  a  strategy  which  is  more 
aggressive  than  a  version  space  strategy  (the  one  proposed  by  Cohen,  Atlas  and  Ladner  in  |86  |  and 
later  analyzed  in  |30|  -  which  we  discussed  in  Section  5.1).  We  point  out  that  this  is  true  even 
in  the  special  case  when  the  data  instances  are  drawn  uniformly  from  the  the  unit  ball  in  R'^,  and 
when  the  labels  are  consistent  with  a  linear  separator  going  through  the  origin.  Indeed,  in  order 
to  obtain  a  truly  exponential  improvement,  and  to  be  able  to  learn  with  only  0[dlog  (^))  labeled 
examples,  we  need,  in  each  iteration,  to  sample  our  examples  from  a  subregion  carefully  chosen, 
and  not  from  the  entire  region  of  uncertainty,  which  would  imply  a  labeled  data  sample  complexity 
of  log  (7)^  The  fact  that  a  truly  exponential  improvement  is  possible  in  this  special  setting 
(through  computationally  efficient  procedures)  was  proven  before  both  in  |98|  and  [112|,  but  via 
more  complicated  and  more  specific  arguments  (and  which  additionally  are  not  easily  generalizable 
to  deal  with  various  types  of  noise). 

2.  We  show  that  our  algorithm  and  argument  extend  to  the  non-realizable  case.  A  specific  case  we 
analyze  here  is  again  the  setting  where  the  data  instances  are  drawn  uniformly  from  the  the  unit  ball 
in  and  a  linear  classifier  w*  is  the  Bayes  classifier.  We  additionally  assume  that  our  data  satisfies 
the  popular  Tsybakov  small  noise  condition  along  the  decision  boundary  |200|.  We  consider  both 
a  simple  version  which  leads  to  exponential  improvement  similar  to  the  item  1  above,  and  a  setting 
where  we  get  only  a  polynomial  improvement  in  the  sample  complexity,  and  where  this  is  provably 
the  best  we  can  do  |80|.  Our  analysis  here  for  this  specific  cases  improves  significantly  the  work 
presented  in  Section  5.1  and  the  previous  related  work  in  |80|. 

Definitions  and  Notation:  In  this  section,  we  consider  learning  linear  classifiers,  so  C  is  the  class  of 
functions  of  the  form  h{x)  =  sign(z/;  ■  x).  As  in  section  5.1,  we  assume  that  the  data  points  (x,  y)  are 
drawn  from  an  unknown  underlying  distribution  P  over  X  xY  and  we  focus  on  the  binary  classification 
case  (i.e.,  Y  =  {—1, 1}).  Our  goal  is  to  find  a  classifier  /  with  small  true  error  where  where  err(fi)  = 
7^  y\-  We  denote  by  d(/i,  g)  the  probability  that  the  two  classifiers  h  and  g  predicf  differently 
on  an  example  coming  at  random  from  P.  Furthermore,  for  a  G  [0, 1]  we  denote  by  B  (/i,  a)  the  set 
{g  I  d(/i,  g)  <  a}.  As  in  section  5. 1  we  let  D  denote  Px- 

In  this  section  we  focus  on  analyzing  margin  based  active  learning  algorithms,  in  particular  variant 
of  Algorithm  8  Specific  choices  for  the  learning  algorithm  A,  sample  sizes  and  cut-off  values  bk 
depends  on  various  assumptions  we  will  make  about  the  data,  which  we  will  investigate  in  details  in 
the  following  sections.  We  note  that  margin  based  active  learning  algorithms  have  been  widely  used  in 
practical  applications  (see  e.g.  |199|). 

5.2.1  The  Realizable  Case  under  the  Uniform  Distribution 

We  assume  here  that  the  data  instances  are  drawn  uniformly  from  the  the  unit  ball  in  and  that  the 
labels  are  consistent  with  a  linear  separator  w*  going  through  the  origin  (that  is  P{w*  •  xy  <  0)  =  0). 
We  assume  that  ||tu*  II2  =  1.  As  mentioned  in  Section  5.1  even  in  this  seemingly  simple  looking  scenario, 
there  exists  an  12  [d  +  log  j))  lower  bound  on  the  PAC  learning  sample  complexity  1 165 1. 

Before  presenting  our  better  bounds,  we  start  by  informally  how  it  is  possible  to  get  a  log 
labeled  sample  complexity  via  a  margin  based  active  learning  algorithm.  (Note  that  the  analysis  for  the 
algorithm  in  Section  5.1.4  already  implies  a  bound  of  0[d^  log  (^)),  and  as  we  in  fact  argue  below 

that  analysis  can  be  improved  to  old^  log  (^)  j  in  the  realizable  case.  We  make  this  clearer  in  the  note 


121 


Algorithm  8  Margin-based  Active  Learning. 

Input:  unlabeled  data  set  Su  =  {xi,X2-, 

a  learning  algorithm  A  that  learns  a  weight  vector  from  labeled  data, 
a  sequence  of  sample  sizes  0  <  mi  <  m2  <  •  •  •  <  m^  =  fhs+i, 
a  sequence  of  cut-off  values  >  0  (/c  =  1 , . . . ,  s) 

Output:  classifier  Wg 

Label  data  points  xi, . . . ,  using  the  oracle 
iterate  /c  =  1 , . . . ,  s 

use  A  to  learn  weight  vector  Wk  from  the  first  fhk  labeled  samples, 
for  j  =  mfc  -h  1, . .  .,mk+i 

if  Irwfc  •  Xj\  >  bk  then  let  yj  =  sign(t()fc  •  Xj) 

else  label  data  point  xj  using  the  oracle 

end  iterate 


after  Theorem  5.2.1  )  Let  us  consider  Algorithm  8,  where  .A  is  a  learning  algorithm  for  finding  a  linear 
classifier  consisfenf  wifh  fhe  fraining  dafa.  Assume  fhaf  in  each  iferafion  k,  A  finds  a  linear  separafor  Wk, 
=  1  which  is  consisfenf  wifh  fhe  firsf  fhk  labeled  examples.  We  wanf  fo  ensure  fhaf  err(ri)fc)  <  ^ 
(wifh  large  probabilify),  which  (by  sfandard  VC  bounds)  requires  a  sample  of  size  fhk  =  0(2^d)',  note 
fhaf  fhis  implies  we  need  fo  add  in  each  iferafion  abouf  =  fhk+i  —fhk  =  O  (2^d)  new  labeled  examples. 
The  desired  resulf  will  follow  if  we  can  show  fhaf  by  choosing  appropriafe  bk,  we  only  need  fo  ask  fhe 
oracle  fo  label  m^  =  ouf  of  fhe  =  0(2^d)  dafa  poinfs  and  ensure  fhaf  all  dafa  poinfs  are 

correcfly  labeled  (i.e.  fhe  examples  labeled  aufomafically  are  in  facl  correcfly  labeled). 

Nofe  fhaf  given  our  assumpfion  abouf  fhe  dafa  disfribufion  fhe  error  rate  of  any  given  separafor  w  is 
err(t(;)  =  ^  ^  where  9{w,w*)  =  sxccos{w  ■  w*).  Therefore  eii{wk)  <  implies  fhaf  \\wk  — 

^^*||2  <  2“^7r.  This  implies  we  can  safely  label  all  fhe  poinfs  wifh  \wk  ■  x\  >  2~^tt  because  w*  and 
Wk  predicf  fhe  same  on  fhose  examples.  The  probabilify  of  x  such  fhaf  \wk  ■  x\  <  2~^Tr  is  0{2~^s/d) 
because  in  high  dimensions,  fhe  1-dimensional  projecfion  of  uniform  random  variables  in  fhe  unif  ball  is 
approximately  a  Gaussian  variable  wifh  variance  1/d.  Therefore  if  we  lef  bk  =  2~^'k  in  fhe  /c-fh  iferafion, 
and  draw  fhk+i  —  fhk  =  0[2^d)  new  examples  fo  achieve  an  error  rate  of  for  Wk+i,  the  expected 

~  3 

number  of  human  labels  needed  is  at  most  0{d2 ).  This  essentially  implies  the  desired  result.  For  a  high 
probability  statement,  we  can  use  Algorithm  9 ,  which  is  a  modification  of  Algorithm  8 . 

Note  that  we  can  apply  our  favorite  algorithm  for  finding  a  consisfenf  linear  separator  (e.g.,  SVM  for 
fhe  realizable  case,  linear  programming,  efc.)  af  each  iferafion  of  Algorifhm  9,  and  fhe  overall  procedure 
is  computationally  efficient. 

Theorem  5.2.1  There  exists  a  constant  C,  such  that  for  any  e,  (5  >  0,  using  Algorithm  9  with 

h  =  and  rrik  =  Cd^  ^dlnd  +  In  , 

after  s  =  [log2  iterations,  we  can  efficiently  find  a  separator  of  error  at  most  e  with  probability  1  —  <5. 

Proof:  The  proof  is  essentially  a  more  a  rigorous  version  of  fhe  informal  one  given  earlier.  We  prove 
by  induction  on  k  fhaf  af  fhe  fe’fh  iferafion,  wifh  probabilify  1  —  <5(1  —  1/(A:  -|-  1)),  we  have  err(u))  <  2~^ 
for  all  w  consisfenf  wifh  dafa  in  fhe  sef  W {kf,  in  particular,  err(t()fc)  <  2~^. 


122 


Algorithm  9  Margin-based  Active  Learning  (separable  case). 

Input:  allowed  error  rate  e,  probability  of  failure  S,  a  sampling  oracle  for  D,  and  a  labeling  oracle 
a  sequence  of  sample  sizes  nik  >  0,  A:  G  Z;  a  sequence  of  cut-off  values  6^  >  0,  A:  G  Z 

Output:  weight  vector  Ws  of  error  at  most  e  with  probability  1  —  5 

Draw  mi  examples  from  D,  label  them  and  put  into  a  working  set  1L(1). 

iterate  A:  =  1 , . . . ,  s 

find  a  hypothesis  Wk  i\\wk\\2  =  1)  consistent  with  all  labeled  examples  in  iy(A;). 
letkL(A:  +  l)  =  W{k). 

until  mk+i  additional  data  points  are  labeled,  draw  sample  x  from  D 
if \wk  ■  x\'>hk  then  reject  x 
else  ask  for  label  of  x,  and  put  into  W{k  +  1) 

end  iterate 


For  A:  =  1,  according  to  Theorem  A.2.1  in  Appendix  A.2,  we  only  need  mi  =  0{d  +  ln(l/5)) 
examples  to  obtain  the  desired  result.  In  particular,  we  have  err(mi)  <1/2  with  probability  1  —  5/2. 
Assume  now  the  claim  is  true  for  A:  —  1.  Then  at  the  A:-th  iteration,  we  can  let 

5i  =  {x  :  \wk-i  ■  x\  <  bk-i}  and  52  =  {x  :  \wk-i  ■  x|  >  bk-i}. 

Using  the  notation  err(r(;|5)  =  Pra;((r(;  •  x){w*  ■  x)  <  0|x  G  5),  for  all  w  we  have: 

err(ri;)  =  err(t()|5i)  Pr(5i)  -|-  err(r()|52)  Pr(52). 

Consider  an  arbitrary  w  consistent  with  the  data  in  1U(A:  —  1).  By  induction  hypothesis,  we  know  that  with 
probability  at  least  1  —  5(1  —  1/A:),  both  Wk-i  and  w  have  errors  at  most  (because  both  are  consistent 
with  W{k  —  1)).  As  discussed  earlier,  this  implies  that  ||t5fc_i  — 'u;*||2  <  2^  and  ||r5  — r(;*||2  <  2^  ^tt. 
Therefore  Vx  G  S2,  we  have 

{wk-i  ■  x){w  •  x)  >  0  and  {wk-i  ■  x){w*  •  x)  >  .0 

This  implies  that  err(t5|52)  =  0.  Now  using  the  estimate  provided  in  Lemma  A.2.2  with  71  =  bk-i  and 
72  =  0,  we  obtain  Pr3;(5i)  <  bk-isj Therefore 

err(m)  <  2^“^\/47r(i  •  err(u;|5i), 

for  all  w  consistent  with  W{k  —  1).  Now,  since  we  are  labeling  mk  data  points  in  Si  at  iteration  A:  —  1, 
it  follows  from  Theorem  A.2.1  that  we  can  find  C  s.  f.  wifh  probabilify  1  —  6/{k‘^  +  k),  for  all  w 
consisfenf  wifh  fhe  dafa  in  W{k),  err(ri;|5i),  fhe  error  of  w  on  Si,  is  no  more  fhan  1 /(dVdvrd) .  Thai  is, 
eTT{w)  <  2~^  wifh  probabilify  al  leasl  1  —  5((1  —  1/A:)-|-  1/(A;^  +  fc))  =  1  —  5(1  —  1/(A;  -|-  1))  for  all  m 
consisfenf  wifh  W (k),  and  in  particular  err(mfe)  <  2“^,  as  desired.  ■ 

The  choice  of  rejection  region  in  Theorem  5.2.1  essentially  follows  fhe  “sampling  from  fhe  region 
of  disagreemenf  idea”  idea  infroduced  in  |86|  for  fhe  realizable  case.  As  mentioned  in  Section  5.L  [86| 
suggesled  fhaf  one  should  nol  sample  from  a  region  {S2  in  fhe  proof)  in  which  all  classifiers  in  fhe  currenl 
version  space  (in  our  case,  classifiers  consisfenf  wifh  fhe  labeled  examples  in  W{k))  predicf  fhe  same 


123 


label.  In  Section  5.1  and  in  |30|  we  have  analyzed  a  more  general  version  of  the  strategy  proposed  in  |86| 
that  is  correct  in  the  much  more  difficult  agnostic  case  and  we  have  provided  theoretical  analysis.  Here  we 
have  used  a  more  a  refined  VC-bound  for  the  realizable  case,  e.g.,  Theorem  A.2.T  to  get  a  better  bound. 
However,  the  strategy  of  choosing  in  Theorem  5.2.1  (thus  the  idea  of  |86|)  is  not  optimal.  This  can  be 
seen  from  the  proof,  in  which  we  showed  err('u)s|S'2)  =  0.  If  we  enlarge  S2  (using  a  smaller  bk),  we  can 
still  ensure  that  err(tUs|S'2)  is  small;  furthermore,  Pr(S'i)  becomes  smaller,  which  allows  us  to  use  fewer 
labeled  examples  to  achieve  the  same  reduction  in  error.  Therefore  in  order  to  show  that  we  can  achieve  an 
improvement  from  to  0[dlog  (^))  as  in  [98 1,  we  need  a  more  aggressive  strategy.  Specifically,  af 

round  k  we  sef  as  margin  parameter  bk  =  ^  consequence  use  fewer  examples  to  transition 

between  rounds.  In  order  to  prove  correctness  we  need  to  refine  fhe  analysis  as  follows: 

Theorem  5.2.2  There  exists  a  constant  C  such  that  for  d  >  4,  and  for  any  e,6  >  0,  e  <  1/4,  using 
Algorithm  9  with 

w-fc  =  C'-\/ln(l  +  k)  ^dln(l  +  In fc)  +  In  — ^  and  bk  =  2^~^7rd~^^‘^ \/5  +  ln(l  +  k), 

after  s  =  [log2  ~  2  iterations,  we  efficiently  find  a  separator  of  error  <  e  with  probability  at  least  1  —  5. 

Proof:  As  in  Theorem  5.2.1,  we  prove  by  induction  on  k  that  at  the  A:’s  iteration,  for  k  <  s,  with  probability 
at  least  1  —  5(1  —  l/(fc  +  1)),  we  err(m)  <  2“^“^  for  all  choices  of  w  consistent  with  data  in  the  working 
set  W{k)',  in  particular  eii{wk)  <  2“^“^. 

For  k  =  1,  according  to  Theorem  A. 2.1,  we  only  need  rrik  =  0(d  +  ln(l/5))  examples  to  obtain  the 
desired  result;  in  particular,  we  have  err(u)i)  <  2~^~^  with  probability  1  —  5/{k  +  1).  Assume  now  the 
claim  is  true  for  /c  —  1  (fe  >  1).  Then  at  the  A:-th  iteration,  we  can  let 

S'!  =  {x  :  \wk-i  ■  x\  <  bk-i} 


and 

52  =  {x  :  \wk-i  ■  x|  >  bk-i}. 

Consider  an  arbitrary  w  consistent  with  the  data  mW{k  —  1).  By  induction  hypothesis,  we  know  that 
with  probability  1  —  5(1  —  1/k),  both  Wk-i  and  w  have  errors  at  most  implying  that 

9{wk-i,w*)  <  2~^~^7r  and  6{w,w*)  <  2“^“^7r. 

Therefore  0{w,  Wk-i)  <  2~^tt.  Let  /3  =  2“^7r  and  using  cos  (5/  sin  (5  <1! (3  and  sin  (5  <  (5  it  is  easy  to 
verify  that  the  following  inequality  holds 


bk-i  >  2  sin  ^d  5  +  In  ^1  +  Y^lnmax(l,cos  (3/  sin^d)^  . 


By  Lemma  A.2.5 ,  we  have  both 


^  r  /  X  /  ~  X  „  1  sin  (3  V2(3 

Pr  [(wk-i  ■  x)(w  ■  x)  <  0,  X  G  52  <  - ^  ^  and 

^  e^cos/3 

Pr  [{wk-i  ■  x){w*  ■  x)  <  0,x  e  S2]  < 

^  cos  f3 


124 


Taking  the  sum,  we  obtain 


Pr  [{w  ■  x){w*  •  x)  <  0,  X  G  S'2]  < 

X 


^  2-{k+3) 


Using  now  Lemma  A.2.2  we  get  that  for  all  w  consistent  with  the  data  in  VU(A:  —  1)  we  have: 


err('u))  <  err(m|S'i)  Pr(5i)  +  2  (^+3)  <  err(t()fc|5i)6fc_iy^4(i/7r  +  2 
<  ^err('u)|5i)16-\/T^-\/5  +  ln(l  +  k)  +  1/2^  . 

Since  we  are  labelling  irik  points  in  Si  at  iteration  —  1,  we  know  from  Theorem  A.2.1  in  Appen¬ 
dix  A.2,  that  3C  s.  t.  with  probability  1  —  5/{k  +  k'^)  we  have 

err(rt;fc|Si)16V^-\/5  +  ln(l  +  k)  <  0.5 

for  all  w  consistent  with  W{k)',  so,  with  probability  l  —  5{{l  —  l/k)  +  l/{k  +  k‘^))  =  1  — <5(1  —  1/(A:  +  1)), 
we  have  err(m)  <  for  all  w  consistent  with  W{k).  ■ 

The  bound  in  Theorem  5.2.2  is  generally  better  than  the  one  in  Theorem  5.2.1  due  to  the  improved 
dependency  on  d  in  m^.  However,  rrik  depends  on  \/ln  k  In  In  k,  for  k  <  [log2  —  2.  Therefore  when 
d  <C  In  A:(lnln  /c)^.  Theorem  5.2.1  offers  a  better  bound.  Note  that  the  strategy  used  in  Theorem  5.2.2  is 
more  aggressive  than  the  strategy  used  in  the  selective  sampling  algorithm  of  [30,  86  j.  Indeed,  we  do  not 
sample  from  the  entire  region  of  uncertainty  -  but  we  sample  just  from  a  subregion  carefully  chosen.  This 
helps  us  to  get  rid  of  the  undesired  Our  analysis  also  holds  with  very  small  modifications  when  the 
input  distribution  comes  from  a  high  dimensional  Gaussian. 


5.2.2  The  Non-realizable  Case  under  the  Uniform  Distrihution 

We  show  that  a  result  similar  to  Theorem  5.2.2  can  be  obtained  even  for  non-separable  problems  under  a 
specific  fype  of  noise  alfhough  nof  necessarily  in  a  compufafionally  efficienl  manner.  The  non-realizable 
(noisy)  case  for  active  learning  in  fhe  confexf  of  classification  was  recenfly  explored  in  [  80  j  and  as  we 
have  seen  in  Section  5.1  in  [30,  35  j  as  well.  We  consider  here  a  model  which  is  relafed  fo  fhe  simple  one¬ 
dimensional  problem  in  [80[,  which  assumes  fhaf  fhe  dafa  satisfy  fhe  increasingly  popular  Tsybakov  small 
noise  condition  along  fhe  decision  boundary  [  200  [.  We  firsl  consider  a  simple  version  which  still  leads  fo 
exponential  convergence  similar  fo  Theorem  5.2.2  Specifically,  we  sfill  assume  fhaf  fhe  dafa  insfances  are 
drawn  uniformly  from  fhe  fhe  unif  ball  in  and  a  linear  classifier  w*  is  fhe  Bayes  classifier.  However, 
we  do  nof  assume  fhaf  fhe  Bayes  error  is  zero.  We  consider  fhe  following  low  noise  condifion:  fhere  exisfs 
a  known  parameter  /3  >  0  such  fhaf: 

P.(|P(y  =  l|x)  -  P{y  =  -l|x)|  >  4/i)  =  1. 


If  is  known  fhaf  in  fhe  passive  supervised  learning  selling  Ibis  condifion  can  lead  fo  fasl  convergence 
rates.  As  we  will  show  in  Ibis  section,  fhe  condifion  can  also  be  used  fo  quanlify  fhe  effecliveness  of 
active-learning.  The  key  poinf  is  fhaf  fhis  assumplion  implies  fhe  slabilily  condifion  required  for  aclive 
learning: 


40( 


w,w 


/3min 


1 


<  err(m)  —  errftu*) 


(5.2.1) 


Algorithm  10  Margin-based  Active  Learning  (non-separable  case). 

Input:  allowed  error  rate  e,  probability  of  failure  S,  a  sampling  oracle  for  D,  and  a  labeling  oracle 
a  sequence  of  sample  sizes  nik  >  0,  A:  G  Z;  a  sequence  of  cut-off  values  6^  >  0,  A:  G  Z 
a  sequence  of  hypothesis  space  radii  >  0,  A:  G  Z; 
a  sequence  of  precision  values  >  0,  A:  G  Z 

Output:  weight  vector  Wg  of  excess  error  at  most  e  with  probability  1  —  6 
Pick  random  tuo:  ||mo||2  =  1- 

Draw  mi  examples  from  D,  label  them  and  put  into  a  working  set  W. 

iterate  A:  =  1 , . . . ,  s 

find  Wk  £  B{wk-i,rk)  (||u)fc||2  =  1)  to  approximately  minimize  training  error: 

E(x,y)etv  •  xy)  <  'E{x,y)ewHw  ■  xy)  +  mfeCfc. 

clear  the  working  set  W 

until  mk+i  additional  data  points  are  labeled,  draw  sample  x  from  D 
if  \wk  ■  x\>bk  then  reject  x 
else  ask  for  label  of  x,  and  put  into  W 

end  iterate 


with  a  =  0.  We  analyze  here  a  more  general  setting  with  a  G  [0, 1).  As  mentioned  already,  the  one 
dimensional  setting  was  examined  in  |80|.  We  call  err (m)  —  err(t(;*)  the.  excess  error  olw.  In  this  setting, 
the  Algorithm  9  needs  to  be  slightly  modified,  as  in  Algorithm  10 

Theorem  5.2.3  Let  d  >  4.  Assume  there  exists  a  weight  vector  w*  s.  t.  the  stability  condition  5. 2.1  holds. 
Then  there  exists  a  constant  C,  s.  t.  for  any  €,6  >  0,  e  <  (3/8,  using  Algorithm  10  with 

bk  =  2-(i-«)fc7rd-i/V5  +  aA:ln2  -  ln/3  +  ln(2  +  A:), 

rk  =  A:  >  1,  ri  =  tt, 

Ck  =  -h  aA:  In  2  —  In  /3  -h  ln(l  -|-  k)  and 

d  +  lnf), 

after  s  =  [log2(/3/e)]  iterations,  we  find  a  separator  with  excess  error  <  e  with  probability  1  —  6. 

Proof:  The  proof  is  similar  to  that  of  Theorem  5.2.2.  We  prove  by  induction  on  k  that  after  k  <  s 
iterations,  eii{wk)  —  eii{w*)  <  2~^j3  with  probability  1  —  (5(1  —  1/(A:  -|-  1)). 

For  A:  =  1,  according  to  Theorem  A.l.l,  we  only  need  mk  =  (5~‘^0{d  +  hi{k/6))  examples  to  obtain 
nil  with  excess  error  2“^/3  with  probability  1  —  (5/(A:-|- 1).  Assume  now  the  claim  is  true  for  A:  —  1  (A:  >  2). 
Then  at  the  A;-th  iteration,  we  can  let 

Si  =  {x  :  \wk-i  ■  x\  <  bk-i}  and  52  =  {x  :  \wk-i  ■  x\  >  bk-i}. 

By  induction  hypothesis,  we  know  that  with  probability  at  least  1  —  (5(1  —  1/A;),  Wk-i  has  excess  errors  at 
most  2“^+^/3,  implying  0(tt;fc_i,  tu*)  <  By  assumption,  9{wk-i,Wk)  < 


126 


Let  P  =  2  ^TT  and  using  cos  /3/  sin  ^  <1/^  and  sin  /3  <  /3,  it  is  easy  to  verify  that  the  following 

inequality  holds: 


bk-i  >  2  sin /3d  ^^^^5  +  Q;A:ln2  —  ln/3  +  In  +  \J\a{cosj3/  sin^^) 


From  Lemma  A.2.5,  we  have  both 


Pr  [{wk-i  ■  x){wk  •  x)  <  0,  X  G  5'2]  < 


sin/3  \/2/3/3 


< 


e5/3-i2“^cos/3  “  2“^e5 


and 


Pr  [{wk-i  ■  x){w*  •  x)  <  0,  X  G  S'2]  < 

^  ^  ^  “  e5/3-i2«fccos/3  “  2“fce5 


Taking  the  sum,  we  obtain 


Pr  [{wk  ■  x){w*  •  x)  <  0,  X  G  S'2]  <  <  2 


Therefore  we  have  (using  Lemma  A.2.2): 


err(n)fc)  —  err(m*)  <  (err(t&fc|5i)  —  err(rc*|5i))  Pr(S'i)  +  2“^^+^^/? 

<  (err(mA:|5'i)  -  err(t(;*|5i))6fc-iy/4d/7r  +  2“(*'+^^/3 

<  2“V  ((err(mfc|5i)  -  err(r(;*|5i))0r/(4efc)  +  1/2)  . 

From  Theorem  A.2.L  we  know  we  can  choose  C  s.  t.  with  nik  samples,  we  obtain 

err(u)fc|5i)  -  err(t(;*|5i)  <  2eklV^ 

with  probability  1  —  5/{k  +  k'^)-  Therefore  err(rt;fc)  <  2“^/3  with  probability  1  —  <5((1  —  1/fc)  +  !/(/::  + 
fe2))  =  l-d(l-l/(fc  +  l)).  ■ 

If  a  =  0,  then  we  can  achieve  exponential  convergence  similar  to  Theorem  5.2.2,  even  for  noisy 
problems.  However,  for  a  G  (0, 1),  we  have  to  label  nik  =  0{e~^°‘  ln(l/e)(d  +  ln(s/d))  examples 
to  an  achieve  error  rate  of  e.  That  is,  we  only  get  a  polynomial  improvement  compared  to  the  batch 
learning  case  (with  sample  complexity  between  0{e~^)  and  0(e“^)).  In  general,  one  cannot  improve 
such  polynomial  behavior  -  see  1 80 1  for  some  simple  one-dimensional  examples. 

Note:  This  bounds  here  improve  significantly  over  the  previous  work  in  |30,  80 1.  |80|  studies  a  similar 
model  to  ours,  but  for  the  much  simpler  one  dimensional  case.  The  model  studied  in  |30|  and  also  consid¬ 
ered  in  Section  5.1  is  more  general,  it  applies  to  the  purely  agnostic  setting  and  also  the  algorithm  itself 
works  generically  for  any  concept  space;  however,  for  the  specific  case  of  learning  linear  separafors  fhe 
bounds  end  up  having  a  worse  quadratic  rafher  fhan  linear  dependence  on  d. 

Note:  Instead  of  rejecting  x  when  \wk  ■  x|  >  bk,  we  can  add  them  to  W  using  the  automatic  labels  from 
Wk-  We  can  then  remove  the  requirement  Wk  G  B{wk-i,rk)  (thus  removing  the  parameters  r^.).  The 
resulting  procedure  will  have  the  same  convergence  behavior  as  Theorem  5.2.3  because  the  probability  of 
making  error  by  Wk  when  •  x|  >  6^  is  no  more  than 


127 


Other  Results  on  Margin  Based  Active  Learning  In  [33|  we  also  give  an  analysis  of  our  algorithm 
for  a  case  where  we  have  a  “good  margin  distribution”,  and  we  show  how  active  learning  can  dramatically 
improve  (the  supervised  learning)  sample  complexity  in  that  setting  as  well;  the  bounds  we  obtain  for 
that  do  not  depend  on  the  dimensionality  d.  We  also  provide  a  generic  analysis  of  our  main  algorithm, 
Algorithm  8 

5.2.3  Discussion 

We  have  shown  here  that  a  more  aggressive  active  learning  strategies  can  generally  lead  to  better  bounds. 
Note  however  that  the  analysis  in  this  section  (based  on  |33|)  was  specific  to  the  realizable  case,  or  done 
for  a  special  type  of  noise.  It  is  an  open  question  to  design  aggressive  agnostic  active  learning  algorithms. 

While  our  algorithm  is  computationally  efficient  in  the  realizable  case,  it  remains  an  open  problem 
to  make  it  efficient  in  the  general  case.  It  is  conceivable  that  for  some  special  cases  (e.g.  the  marginal 
distribution  over  the  instance  space  is  uniform,  as  in  section  5.2.2)  one  could  use  the  recent  results  of 
Kalai  et.  al.  for  Agnostically  Learning  Halfspaces  1 145 1.  In  fact,  it  would  be  interesting  to  derive  precise 
bounds  (both  in  the  realizable  and  the  non-realizable  cases)  for  the  more  general  of  class  of  log-concave 
distributions. 


5.3  Other  Results  in  Active  Learning 

In  recent  work,  we  also  show  that  in  an  asymptotic  model  for  active  learning  where  one  bounds  the  number 
of  queries  the  algorithm  makes  before  it  finds  a  good  function  (i.e.  one  of  arbitrarily  small  error  rate),  but 
not  the  number  of  queries  before  it  knows  it  has  found  a  good  function,  one  can  obtain  significantly  better 
bounds  on  the  number  of  label  queries  required  to  learn  than  in  the  traditional  active  learning  models. 
These  results  appear  in  |34,  41J. 

Specifically,  in  |34,  41 1  we  point  out  that  traditional  analyses  |94|  have  studied  the  number  of  label 
requests  required  before  an  algorithm  can  both  produce  an  e-good  classifier  and  prove  that  the  classifier’s 
error  is  no  more  than  e.  These  studies  have  turned  up  simple  examples  where  this  number  is  no  smaller  than 
the  number  of  random  labeled  examples  required  for  passive  learning.  This  is  the  case  for  learning  certain 
nonhomogeneous  linear  separators  and  intervals  on  the  real  line,  and  generally  seems  to  be  a  common 
problem  for  many  learning  scenarios.  As  such,  it  has  led  some  to  conclude  that  active  learning  does  not 
help  for  most  learning  problems.  In  our  work  |34,  41 1  we  dispel  this  misconception.  Specifically,  we 
study  the  number  of  labels  an  algorithm  needs  to  request  before  it  can  produce  an  e— good  classifier,  even 
if  there  is  no  accessible  confidence  bound  available  to  verify  the  quality  of  the  classifier.  With  this  type 
of  analysis,  we  prove  that  active  learning  can  essentially  always  achieve  asymptotically  superior  sample 
complexity  compared  to  passive  learning  when  the  VC  dimension  is  finite.  Furthermore,  we  find  fhat  for 
most  natural  learning  problems,  including  the  negative  examples  given  in  the  previous  literature,  active 
learning  can  achieve  exponential  improvements  over  passive  learning  with  respect  to  dependence  on  e. 
Full  details  of  the  model  and  results  can  be  found  in  [34,  41 1. 


128 


Chapter  6 

Kernels,  Margins,  and  Random  Projections 


In  this  chapter  we  return  to  study  learning  with  kernel  functions.  As  discussed  in  Chapter  3 ,  a  kernel 
is  a  function  that  takes  in  two  data  objects  (which  could  be  images,  DNA  sequences,  or  points  in  R^) 
and  outputs  a  number,  with  the  property  that  the  function  is  symmetric  and  positive-semidefinite.  That 
is,  for  any  kernel  K,  there  must  exist  an  (implicit)  mapping  (f),  such  that  for  all  inputs  x,x'  we  have 
K{x,  x')  =  (j){x)  ■  4){x').  The  kernel  is  then  used  inside  a  “kernelized”  learning  algorithm  such  as  SVM 
or  kemel-perceptron  as  the  way  in  which  the  algorithm  interacts  with  the  data.  Furthermore  even  though 
(j)  may  be  a  mapping  into  a  very  high-dimensional  space,  these  algorithms  have  convergence  rates  that 
depend  only  on  the  margin  7  of  the  best  separator,  and  not  on  the  dimension  of  the  cf)  space  [18,  191  j. 
Thus,  kernel  functions  are  often  viewed  as  providing  much  of  the  power  of  this  implicit  high-dimensional 
space,  without  paying  for  it  computationally  (because  the  (j)  mapping  is  only  implicit)  or  in  terms  of  sample 
size  (if  data  is  indeed  well-separated  in  that  space). 

In  this  chapter,  we  point  out  that  the  Johnson-Lindenstrauss  [96]  lemma  suggests  that  in  the  presence 
of  a  large  margin,  a  kernel  function  can  also  be  viewed  as  a  mapping  to  a  /ow-dimensional  space,  one  of 
dimension  only  0(1/7^).  We  then  explore  the  question  of  whether  one  can  efficiently  produce  such  low¬ 
dimensional  mappings,  using  only  black-box  access  to  a  kernel  function.  That  is,  given  just  a  program 
that  computes  K{x,y)  on  inputs  x,y  of  our  choosing,  can  we  efficiently  construct  an  explicit  (small) 
set  of  features  that  effectively  capture  the  power  of  the  implicit  high-dimensional  space?  We  answer 
this  question  in  the  affirmative  if  our  method  is  also  allowed  black-box  access  to  the  underlying  data 
distribution  (i.e.,  unlabeled  examples).  We  also  give  a  lower  bound,  showing  that  if  we  do  not  have  access 
to  the  distribution,  then  this  is  not  possible  for  an  arbitrary  black-box  kernel  function. 

Our  positive  result  can  be  viewed  as  saying  that  designing  a  good  kernel  function  is  much  like  de¬ 
signing  a  good  feature  space.  Given  a  kernel,  by  running  it  in  a  black-box  manner  on  random  unlabeled 
examples,  we  can  efficiently  generate  an  explicit  set  of  0(1/7^)  features,  such  that  if  the  data  was  linearly 
separable  with  margin  7  under  the  kernel,  then  it  is  approximately  separable  in  this  new  feature  space. 


6.1  Introduction 

The  starting  point  for  this  chapter  is  the  observation  that  if  a  learning  problem  indeed  has  the  large  margin 
property  under  some  kernel  K{x,y)  =  f{x)  ■  (j){y),  then  by  the  Johnson-Lindenstrauss  lemma,  a  ran¬ 
dom  linear  projection  of  the  “iji-space”  down  to  a  low  dimensional  space  approximately  preserves  linear 
separability  [7,21, 96,  142|.  Specifically,  suppose  data  comes  from  some  underlying  distribution  D  over 
the  input  space  X  and  is  labeled  by  some  target  function  c.  If  D  is  such  that  the  target  function  has 


129 


margin  7  in  the  (/)-space/  then  a  random  linear  projection  of  the  ^-space  down  to  a  space  of  dimension 
d  =  O  log  ^ ^  will,  with  probability  at  least  1  —  (i,  have  a  linear  separator  with  error  rate  at  most  e  (see 

Arriaga  and  Vempala  [21 1  and  also  Theorem  6.4.2  in  this  chapter).  This  means  that  for  any  kernel  K  and 
margin  7,  we  can,  in  principle,  think  of  K  as  mapping  the  input  space  X  into  an  0(l/7^)-dimensional 
space,  in  essence  serving  as  a  method  for  representing  the  data  in  a  new  (and  not  too  large)  feature  space. 

The  question  we  consider  in  this  chapter  is  whether,  given  kernel  K,  we  can  in  fact  produce  such 
a  mapping  efficiently.  The  problem  with  the  above  observation  is  that  it  requires  explicitly  computing 
the  function  (^(x).  In  particular,  the  mapping  of  X  into  that  results  from  applying  the  Johnson- 
Lindenstrauss  lemma  is  a  function  F{x)  =  (ri  •  0(x), . . . ,  •  4>{x)),  where  ri, . . . ,  are  random  vectors 

in  the  0-space.  Since  for  a  given  kernel  K,  the  dimensionality  of  the  0-space  might  be  quite  large,  this  is 
not  efficient.  Instead,  what  we  would  like  is  an  efficient  procedure  that  given  K{., .)  as  a  black-box  pro¬ 
gram,  produces  a  mapping  with  the  desired  properties  and  with  running  time  that  depends  (polynomially) 
only  on  1  /j  and  the  time  to  compute  the  kernel  function  K,  with  no  dependence  on  the  dimensionality  of 
the  0-space. 

Our  main  result  is  a  positive  answer  to  this  question,  if  our  procedure  for  computing  the  mapping 
is  also  given  black-box  access  to  the  distribution  D  (i.e.,  unlabeled  data).  Specifically,  given  black-box 
access  fo  a  kernel  funclion  K{x,  y),  a  margin  value  7,  access  to  unlabeled  examples  from  distribution  D, 
and  parameters  e  and  5,  we  can  in  polynomial  time  construct  a  mapping  F  :  X  ^  R'^  (i.e.,  to  a  set  of  d 
real- valued  features)  where  d  =  O  log  with  the  following  property.  If  the  target  concept  indeed 
has  margin  7  in  the  0-space,  then  with  probability  1  —  5  (over  randomization  in  our  choice  of  mapping 
function),  the  induced  distribution  in  is  separable  with  error  <  e.  In  fact,  not  only  will  the  data  in 
R^  be  separable,  but  it  will  be  separable  with  margin  12(7).  Note  that  the  logarithmic  dependence  on  e 
implies  that  if  the  learning  problem  has  a  perfect  separator  of  margin  7  in  the  0-space,  we  can  set  e  small 
enough  so  that  with  high  probability  a  set  S  of  0(dlog  d)  labeled  examples  would  be  perfectly  separable 
in  the  mapped  space.  This  means  we  could  apply  an  arbitrary  zero-noise  linear-separator  learning  algo¬ 
rithm  in  the  mapped  space,  such  as  a  highly-optimized  linear-programming  package.  However,  while  the 
dimension  d  has  a  logarithmic  dependence  on  1  /e,  the  number  of  (unlabeled)  examples  we  use  to  produce 
our  mapping  is  0(l/(7^e)). 

To  give  a  feel  of  what  such  a  mapping  might  look  like,  suppose  we  are  willing  to  use  dimension 
d  =  -|-  In  ^])  (so  this  is  linear  in  1/e  rather  than  logarithmic)  and  we  are  not  concerned  with 

preserving  margins  and  only  want  approximate  separability.  Then  we  show  the  following  simple  proce¬ 
dure  suffices.  Just  draw  a  random  sample  of  d  unlabeled  points  xi, ...  ,Xd  from  D  and  define  F{x)  = 
{K{x,  xi), . . .  ,K{x,  Xd)).  That  is,  if  we  think  of  K  not  so  much  as  an  implicit  mapping  into  a  high¬ 
dimensional  space  but  just  as  a  similarity  function  over  examples,  what  we  are  doing  is  drawing  d  “ref¬ 
erence”  points  and  then  defining  the  zth  feature  of  x  to  be  its  similarity  with  reference  point  i.  We  show 
(Corollary  6.3.2)  that  under  the  assumption  that  the  target  function  has  margin  7  in  the  0  space,  with 
high  probability  the  data  will  be  approximately  separable  under  this  mapping.  Thus,  this  gives  a  partic¬ 
ularly  simple  way  of  using  the  kernel  and  unlabeled  data  for  feature  generation,  and  in  fact  this  was  the 
motivation  for  the  work  presented  in  Chapter  3 

Given  the  above  results,  a  natural  question  is  whether  it  might  be  possible  to  perform  mappings  of  this 
type  without  access  to  the  underlying  distribution.  In  Section  6.5  we  show  that  this  is  in  general  not  pos¬ 
sible,  given  only  black-box  access  (and  polynomially-many  queries)  to  an  arbitrary  kernel  K.  However, 
it  may  well  be  possible  for  specific  standard  kernels  such  as  the  polynomial  kernel  or  the  gaussian  kernel. 

'That  is,  there  exists  a  linear  separator  in  the  c/i-space  such  that  any  example  from  D  is  correctly  classified  by  margin  7.  See 
Section  6.2  for  formal  definitions.  In  Section  6.4.1  we  consider  the  more  general  case  that  only  a  1  —  q  fraction  of  the  distribution 
D  is  separated  by  margin  7. 


130 


Relation  to  Support  Vector  Machines  and  Margin  Bounds;  Given  a  set  S  of  n  training  examples,  the 
kernel  matrix  defined  over  S  can  be  viewed  as  placing  S  into  an  n-dimensional  space,  and  the  weight- 
vector  found  by  an  SVM  will  lie  in  this  space  and  maximize  the  margin  with  respect  to  the  training  data. 
Our  goal  is  to  define  a  mapping  over  fhe  enfire  disfribufion,  wifh  guaranfees  wifh  respecf  fo  fhe  disfribufion 
ifself.  In  addifion,  fhe  consfrucfion  of  our  mapping  requires  only  unlabeled  examples,  and  so  could  be 
performed  before  seeing  any  labeled  fraining  dafa  if  unlabeled  examples  are  freely  available.  There  is, 
however,  a  close  relafion  fo  margin  bounds  [44,  191 1  for  SVMs  (see  fhe  remark  afler  fhe  sfafemenf  of 
Lemma  6.3.1  in  Section  6.3),  fhough  fhe  dimension  of  our  oufpuf  space  is  lower  fhan  fhaf  produced  by 
combining  SVMs  wifh  sfandard  margin  bounds. 

Our  goals  are  fo  some  exfenf  relafed  fo  fhose  of  Ben-David  ef  al.  [48 ,  50|.  They  show  negafive  resulfs 
giving  simple  classes  of  learning  problems  for  which  one  cannof  consfrucf  a  mapping  fo  a  low-dimensional 
space  under  which  all  funclions  in  fhe  class  are  linearly  separable.  We  resfricf  ourselves  fo  sifuafions  where 
we  know  fhaf  such  mappings  exisf,  buf  our  goal  is  fo  produce  fhem  efficienlly. 

Interpretation:  Kernel  functions  are  often  viewed  as  providing  much  of  the  power  of  an  implicit  high¬ 
dimensional  space  without  having  to  pay  for  it.  Our  results  suggest  that  an  alternative  view  of  kernels  is 
as  a  (distribution-dependent)  mapping  into  a  low-dimensional  space.  In  this  view,  designing  a  good  kernel 
function  is  much  like  designing  a  good  feature  space.  Given  a  kernel,  by  running  it  in  a  black-box  manner 
on  random  unlabeled  examples,  one  can  efficiently  generate  an  explicit  set  of  0(1/7^)  features,  such  that 
if  the  data  was  linearly  separable  with  margin  7  under  the  kernel,  then  it  is  approximately  separable  using 
these  new  features. 

Outline  of  this  chapter:  We  begin  with  by  giving  our  formal  model  and  definitions  in  Section  6.2  We 
then  in  Section  6.3  show  that  the  simple  mapping  described  earlier  in  this  section  preserves  approximate 
separability,  and  give  a  modification  that  approximately  preserves  both  separability  and  margin.  Both  of 
these  map  data  into  a  d-dimensional  space  for  d  =  0{j[^  +  In  ^j).  In  Section  6.4,  we  give  an  improved 
mapping,  that  maps  data  to  a  space  of  dimension  only  0{^  log  ^).  This  logarithmic  dependence  on  ^ 
means  we  can  set  e  small  enough  as  a  function  of  the  dimension  and  our  input  error  parameter  that  we 
can  then  plug  in  a  generic  zero-noise  linear  separator  algorithm  in  the  mapped  space  (assuming  the  target 
function  was  perfectly  separable  with  margin  7  in  the  (/)-space).  In  Section  6.5  we  give  a  lower  bound, 
showing  that  for  a  black-box  kernel,  one  must  have  access  to  the  underlying  distribution  D  if  one  wishes 
to  produce  a  good  mapping  into  a  low-dimensional  space. 

6.2  Notation  and  Definitions 

We  briefly  introduce  here  the  notation  needed  throughout  the  chapter.  We  assume  that  data  is  drawn  from 
some  distribution  D  over  an  instance  space  X  and  labeled  by  some  unknown  target  function  c  :  X  — > 
{— 1,  -fl}.  We  use  P  to  denote  the  combined  distribution  over  labeled  examples. 

A  kernel  K  is  a  pairwise  function  K{x,  y)  that  can  be  viewed  as  a  “legal”  definition  of  inner  product. 
Specifically,  there  must  exist  a  function  cj)  mapping  X  into  a  possibly  high-dimensional  Euclidean  space 
such  that  Ar(x,  y)  =  (j){x)  ■  (l){y) .  We  call  the  range  of  1^  the  “i?i)-space”,  and  use  to  denote  the  induced 
distribution  in  the  (/)-space  produced  by  choosing  random  x  from  D  and  then  applying  (j){x). 

For  simplicity  we  focus  on  the  0  —  1  loss  for  most  of  this  chapter.  We  say  that  for  a  set  S  of  labeled 
examples,  a  vector  w  in  the  (/)-space  has  margin  7  if: 

min 

(x,l)&S 


W  ■  (j){x) 
mil  \\^{x)\\ 


131 


That  is,  w  has  margin  7  if  any  labeled  example  in  S  is  correctly  classified  by  the  linear  separator  w-(l){x)  > 
0,  and  furthermore  the  cosine  of  the  angle  between  w  and  (/)(x)  has  magnitude  at  least  7.^  If  such  a  vector 
w  exists,  then  we  say  that  S  is  linearly  separable  with  margin  7  under  the  kernel  K.  For  simplicity,  we  are 
only  considering  separators  that  pass  through  the  origin,  though  our  results  can  be  adapted  to  the  general 
case  as  well  (see  Section  6.4.1 ). 

We  can  similarly  talk  in  terms  of  the  distribution  P  rather  than  a  sample  S.  We  say  that  a  vector  w  in 
the  (/)-space  has  margin  7  with  respect  to  P  if: 


Pr 

{x,l)r^P 


w  ■  (/>(x) 
m||  \\(/){x)\\ 


<  7 


=  0. 


If  such  a  vector  w  exists,  then  we  say  that  P  is  linearly  separable  with  margin  7  under  K  (or  just  that  P 
has  margin  7  in  the  (/)-space).  One  can  also  weaken  the  notion  of  perfect  separability.  We  say  that  a  vector 
w  in  the  (/)-space  has  error  a  at  margin  7  if: 


Pr 


W  ■  4>{x) 

wll  ||(/>(x)|| 


<  7 


<  a. 


Our  starting  assumption  in  this  chapter  will  be  that  P  is  perfectly  separable  with  margin  7  under  K, 
but  we  can  also  weaken  the  assumption  to  the  existence  of  a  vector  w  with  error  a  at  margin  7,  with  a 
corresponding  weakening  of  the  implications  (see  Section  6.4.1).  Our  goal  is  a  mapping  F  ■.  X  ^ 
where  d  is  not  too  large  that  approximately  preserves  separability,  and,  ideally,  the  margin.  We  use  F{D) 
to  denote  the  induced  distribution  in  produced  by  selecting  points  in  X  from  D  and  then  applying  F, 
and  use  F{P)  =  F{D,  c)  to  denote  the  induced  distribution  on  labeled  examples. 

For  a  set  of  vectors  vi,V2, . . . ,  in  Euclidean  space,  let  span(r;i, . . . ,  Vk)  denote  the  set  of  vectors  v 
that  can  be  written  as  a  linear  combination  aiui  +  . . .  +  a^Vk-  Also,  for  a  vector  v  and  a  subspace  Y,  let 
proj(u,  Y)  be  the  orthogonal  projection  of  v  down  to  Y.  So,  for  instance,  proj(u,  span(r;i, . . . ,  Vk))  is  the 
orthogonal  projection  of  v  down  to  the  space  spanned  by  ui, . . . ,  We  note  that  given  a  set  of  vectors 
vi, ...  ,Vk  and  the  ability  to  compute  dot-products,  this  projection  can  be  computed  efficiently  by  solving 
a  set  of  linear  equalities. 


6.3  Two  simple  mappings 


Our  goal  is  a  procedure  that  given  black-box  access  to  a  kernel  function  K{., .),  unlabeled  examples  from 
distribution  D,  and  a  margin  value  7,  produces  a  (probability  distribution  over)  mappings  F  :  X  ^  R^ 
with  the  following  property:  if  the  target  function  indeed  has  margin  7  in  the  (/)-space,  then  with  high 
probability  our  mapping  will  approximately  preserve  linear  separability.  In  this  section,  we  analyze  two 
methods  that  both  produce  a  space  of  dimension  d  =  0{j[^  -|-lnj]),  where  e  is  our  desired  bound  on 
the  error  rate  of  the  best  separator  in  the  mapped  space.  The  second  of  these  mappings  in  fact  satisfies  a 
stronger  condition  that  its  output  will  be  approximately  separable  at  margin  7/2  (rather  than  just  approx¬ 
imately  separable).  This  property  will  allow  us  to  use  this  mapping  as  a  first  step  in  a  better  mapping  in 
Section  6.4 

The  following  lemma  is  key  to  our  analysis. 

Lemma  6.3.1  Consider  any  distribution  over  labeled  examples  in  Euclidean  space  such  that  there  exists 
a  vector  w  with  margin  7.  Then  if  we  draw 


d  > 


T 


+  In 


1 

<5 


^This  is  equivalent  to  the  notion  of  margin  in  Chapter  3  since  there  we  have  assumed  ||())(a;)||  <  1. 


132 


examples  zi, ...  ,Zd  i-i-d.  from  this  distribution,  with  probability  >1  —  5,  there  exists  a  vector  w'  in 
span(2:i, . . . ,  zf)  that  has  error  at  most  £  at  margin  yl‘1. 

Before  proving  Lemma  6.3.1 ,  we  remark  that  a  somewhat  weaker  bound  on  d  can  be  derived  from  the 
machinery  of  margin  bounds.  Margin  bounds  [44,  191  j  tell  us  that  using  d  =  0{^[^  log^(;^)  +  log  |]) 
points,  with  probability  1  —  5,  any  separator  with  margin  >  7  over  the  observed  data  has  true  error  <  e. 
Thus,  the  projection  of  the  target  function  w  into  the  space  spanned  by  the  observed  data  will  have  true 
error  <  e  as  well.  (Projecting  w  into  this  space  maintains  the  value  of  w  ■  Zi,  while  possibly  shrinking  the 
vector  w,  which  can  only  increase  the  margin  over  the  observed  data.)  The  only  technical  issue  is  that  we 
want  as  a  conclusion  for  the  separator  not  only  to  have  a  low  error  rate  over  the  distribution,  but  also  to 
have  a  large  margin.  However,  this  can  be  obtained  from  the  double-sample  argument  used  in  [44 ,  191  [  by 
using  a  7/4-cover  instead  of  a  7/2-cover.  Margin  bounds,  however,  are  a  bit  of  an  overkill  for  our  needs, 
since  we  are  only  asking  for  an  existential  statement  (the  existence  of  w')  and  not  a  universal  statement 
about  all  separators  with  large  empirical  margins.  For  this  reason  we  are  able  to  get  a  better  bound  by  a 
direct  argument  from  first  principles. 

Proof  of  Lemma  6.3.1-  For  any  set  of  points  S,  let  Win{S)  be  the  projection  of  w  to  span(5'),  and  let 
WcmtiS)  be  the  orthogonal  portion  of  w,  so  that  w  =  Win{S)  +  Wout{S)  and  Win{S)  _L  Wout{S).  Also,  for 
convenience,  assume  w  and  all  examples  2  are  unit-length  vectors  (since  we  have  defined  margins  in  ferms 
of  angles,  we  can  do  fhis  wifhouf  loss  of  generalify).  Now,  lef  us  make  fhe  following  definifions.  Say  fhaf 
Wout{S)  is  large  if  Prz(|moMi(5')  •  z\  >  yjl)  >  e,  and  ofherwise  say  fhaf  Wout{S)  is  small.  Notice  fhaf  if 
WoutiS)  is  small,  we  are  done,  because 

W  ■  Z  =  {Win{S)  ■  z)  +  {wout{S)  •  z), 

which  means  fhaf  Win{S)  has  fhe  properfies  we  wanf.  Thai  is,  Ihere  is  al  mosl  an  e  probabilily  mass  of 
poinls  z  whose  dol-producl  wilh  w  and  Win{S)  differ  by  more  lhan  7/2.  So,  we  need  only  lo  consider 
whal  happens  when  Wout{S)  is  large. 

The  crux  of  fhe  proof  now  is  fhaf  if  Wout{S)  is  large,  fhis  means  fhaf  a  new  random  poinl  z  has  al  leasl 
an  £  chance  of  significanlly  improving  fhe  sel  S.  Specifically,  consider  z  such  fhaf  \wout{S)  ■  z\  >7/2. 
Lef  Zin{S)  be  fhe  projection  of  2;  fo  span(5),  lef  Zout{S)  =  z  —  Zin{S)  be  fhe  portion  of  z  orfhogonal  fo 
span(5),  and  lef  z'  =  Zout{S) /\\zout{S)\\.  Now,  for  S'  =  SU  {z},  we  have 


WoutiS')  =  Wout{S)  -  proj{wout{S),spa.n{S'))  =  Wout{S)  -  {wout{S)  ■  z')z', 


where  fhe  Iasi  equably  holds  because  Wout{S)  is  orfhogonal  lo  span(S')  and  so  ils  projection  onto  span(S'') 
is  fhe  same  as  ils  projeclion  onlo  z' .  Finally,  since  Wout{S')  is  orfhogonal  lo  z'  we  have 

\\wout{S')\\‘^  =  ||mout(5')|p  -  \wout{S)  ■  z'p, 


and  since 

(5")  ■  I  >  \WoutiS^  •  Zout{,S)\  —  {yJoutiS^  •  z|, 

Ibis  implies  by  definilion  of  z  fhaf 

\\Wout{S')\\^  <\\Wout{S)\\^  - 


So,  we  have  a  silualion  where  so  long  as  Wout  is  large,  each  example  has  al  leasl  an  £  chance  of 


reducing  ||mont|r  7^/4,  and  since  ||m|f  =  ||mout(0) 

times.  Chemoff  bounds  slale  lhal  a  coin  of  bias  e  flipped  n  =  f 


=  1,  fhis  can  happen  al  mosl  4/7^ 
+  In  j  times  will  wilh  probabilily 


133 


1  —  5  have  at  least  ne/2  >  4/7^  heads.  Together,  these  imply  that  with  probability  at  least  1  —  5,  Wout{S) 
will  be  small  for  [S'!  >  |  ;^  +  In  j  as  desired.  ■ 

Lemma  6.3.1  implies  that  if  P  is  linearly  separable  with  margin  7  under  K,  and  we  draw  d  =  -[^  + 

£  '•'y 

In  j]  random  unlabeled  examples  xi, . . .  ,Xd  from  D,  then  with  probability  at  least  1—5  there  is  a  separator 
w'  in  the  c/i-spaee  with  error  rate  at  most  e  that  ean  be  written  as 

=  ai(p{xi)  +  . . .  +  ad4>ixd). 

Notice  that  since  w'  ■4>{x)  =  aiK{x,  xi)  +  . .  .-\-adK{x,  Xd),  an  immediate  implication  is  that  if  we  simply 
think  of  K{x,  Xi)  as  the  ith  “feature”  of  x  —  that  is,  if  we  define  Fi{x)  =  {K{x,  xi ),...,  iT(x,  Xd))  — 
then  with  high  probability  the  vector  (ai, . . . ,  Ud)  is  an  approximate  linear  separator  of  Fi{P).  So,  the 
kernel  and  distribution  together  give  us  a  particularly  simple  way  of  performing  feature  generation  that 
preserves  (approximate)  separability.  Formally,  we  have  the  following. 

Corollary  6.3.2  If  P  has  margin  7  in  the  f-space,  then  with  probability  >1  —  5,  ifxi, . . .  ,Xd  are  drawn 

,  the  mapping 


from  D  for  d  =  | 


+  In  4 


Fi{x)  =  {K{x,  xi), . . .  ,K{x,  Xd)) 


produces  a  distribution  Fi{P)  that  is  linearly  separable  with  error  at  most  e. 


The  above  mapping  Fi  may  not  preserve  margins  (within  a  constant  factor)  because  we  do  not  have 
a  good  bound  on  the  length  of  the  vector  (ai, . . . ,  Ud)  defining  fhe  separator  in  fhe  new  space,  or  fhe 
lengfh  of  fhe  examples  Fi(x).  The  key  problem  is  fhaf  if  many  of  fhe  (^{xf)  are  very  similar,  fhen  fheir 
associated  fealures  K{x,  Xi)  will  be  highly  correlated.  Insfead,  fo  preserve  margin  we  wanf  fo  choose  an 
orfhonormal  basis  of  fhe  space  spanned  by  fhe  (j){xi):  i.e.,  to  do  an  orthogonal  projecfion  of  (/)(x)  into 
fhis  space.  Specifically,  lef  S  =  {xi, ...,  x^}  be  a  sef  of  of  +  In  j]  unlabeled  examples  from  D. 
We  can  fhen  implemenf  fhe  desired  orfhogonal  projecfion  of  (f>{x)  as  follows.  Run  K{x,  y)  for  all  pairs 
x,y  £  S,  and  lef  M{S)  =  {K(xi,Xj))xi,xjGS  be  the  resulfing  kernel  mafrix.  Now  decompose  M{S) 
info  U'^U,  where  U  is  an  upper-friangular  mafrix.  Finally,  define  fhe  mapping  F2  '■  X  ^  to  be 
F2{x)  =  Fi{x)U~^,  where  Fi  is  fhe  mapping  of  Corollary  6.3.2  This  is  equivalenf  to  an  orfhogonal 
projecfion  of  ^{x)  info  span((/)(xi), . . . ,  (j){xd))-  Technically,  if  U  is  nol  full  rank  fhen  we  wanf  to  use  fhe 
(Moore-Penrose)  pseudoinverse  |5 1 1  of  (7  in  place  of  U~^. 

We  now  claim  fhaf  by  Lemma  6.3.  L  fhis  mapping  F2  mainfains  approximate  separabilify  af  margin 
7/2. 


Theorem  6.3.3  If  P  has  margin  7  in  the  f-space,  then  with  probability  >1  —  5,  the  mapping  F2  :  X  ^ 
for  d  >  I  +  In  j  has  the  property  that  F2{P)  is  linearly  separable  with  error  at  most  e  at  margin 
7/2. 

Proof:  The  fheorem  follows  direcfly  from  Lemma  6.3.1  and  fhe  facl  fhaf  F2  is  an  orfhogonal  projec- 
fion.  Specifically,  since  fiD)  is  separable  af  margin  7,  Lemma  6.3.1  implies  fhaf  for  d  >  |  ;^  +  In  ^  , 

wifh  probabilify  af  leasf  1  —  5,  fhere  exisfs  a  vector  fhaf  can  be  wriffen  as  =  ai(l){xi)  +  ...padfixd  , 
fhaf  has  error  af  mosf  e  af  margin  7/2  wifh  respecf  fo  4>{P),  i.e.. 


Pr 

{x,l)r^P 


l{w'  ■  fix)) 
Iw'll  \\fix)\\ 


<  e. 


Now  consider  w  =  aiF2ixi)  +  . . .  +  adF2ixd)-  Since  F2  is  an  orfhogonal  projection  and  fhe  fixi)  are 
clearly  already  in  fhe  space  spanned  by  fhe  fixi),  w  can  be  viewed  as  fhe  same  as  w'  buf  jusf  wriffen  in 


134 


a  different  basis.  In  particular,  we  have  ||r(;||  =  ||m^||,  and  w'  ■  (/>(x)  =  w  ■  F2{x)  for  all  x  £  X.  Since 
||-p2(a^)||  <  ll</'(®)ll  for  every  x  £  X,  we  get  that  w  has  error  at  most  e  at  margin  7/2  with  respect  to 
F2{P),  i.e., 


l{w  ■  F2{x))  7 


Pr 


<  e. 


Therefore,  for  our  choice  of  d,  with  probability  at  least  1  —  5  (over  randomization  in  our  choice  of  F2), 
there  exists  a  vector  w  £  that  has  error  at  most  e  at  margin  7/2  with  respect  to  F2{P).  ■ 


Notice  that  the  running  time  to  compute  F2 {x)  is  polynomial  in  1/7,1/e, 1/5  and  the  time  to  compute 
the  kernel  function  K. 


6.4  An  improved  mapping 


We  now  describe  an  improved  mapping,  in  which  the  dimension  d  has  only  a  logarithmic,  rather  than 
linear,  dependence  on  1  /e.  The  idea  is  to  perform  a  two-stage  process,  composing  the  mapping  from 
the  previous  section  with  a  random  linear  projection  from  the  range  of  that  mapping  down  to  the  desired 
space.  Thus,  this  mapping  can  be  thought  of  as  combining  two  types  of  random  projection:  a  projection 
based  on  points  chosen  at  random  from  D,  and  a  projection  based  on  choosing  points  uniformly  at  random 
in  the  intermediate  space. 

We  begin  by  stating  a  result  from  [7,  21 , 96,  136,  142|  that  we  will  use.  Here  A^(0, 1)  is  the  standard 
Normal  distribution  with  mean  0  and  variance  1  and  ?7(— 1, 1)  is  the  distribution  that  has  probability  1/2 
on  —1  and  probability  1/2  on  1.  Here  we  present  the  specific  form  given  in  [21J. 

Theorem  6.4.1  (Neuronal  RP  [21  j)  Let  u,v  £  TR.  Let  u'  =  -^uA  and  v'  =  -^vA  where  Ais  an  x  k 

yk  yk 

random  matrix  whose  entries  are  chosen  independently  from  either  N{0, 1)  or  (7(— 1, 1).  Then, 

I^r  [(1  —  e)||rt  —  v\\^  <  ||rt'  —  v'\\^  <  (1  +  e)||u  —  r;||^]  >  1  — 


Let  F2  :  X  —>  be  the  mapping  from  Section  6.3  using  e/2  and  5/2  as  its  error  and  confidence 
paramefers  respecfively.  Lef  F  :  R'^'^  — >  R'^^  be  a  random  projecfion  as  in  Theorem  6.4. 1  Specifically, 
we  pick  ^4  fo  be  a  random  d2  x  ds  mafrix  whose  enfries  are  chosen  i.i.d.  A^(0, 1)  or  (7(— 1, 1).  We  fhen 
sef  F{x)  =  -^xA.  We  finally  consider  our  overall  mapping  F^  :  X  ^  R'^'^  fo  be  T3(x)  =  F{F2{x)). 

We  now  claim  fhaf  for  d2  =  0{^[^  +  In  |])  and  d^  =  0{:^  log(^)),  wifh  high  probabilify,  fhis 
mapping  has  fhe  desired  properfies.  The  basic  argumenf  is  fhaf  fhe  inifial  mapping  F2  mainfains  approxi¬ 
mate  separabilify  af  margin  7/2  by  Lemma  6.3.1,  and  fhen  fhe  second  mapping  approximafely  preserves 
fhis  properly  by  Theorem  6.4.1 


Theorem  6.4.2  If  P  has  margin  7  in  the  f-space,  then  with  probability  at  least  1  —  5,  the  mapping 
F^  =  F  o  F2  :  X  ^  R'^^,  for  values  d2  =  O  -|- In  ^  ^  and  d^  =  O  has  the 

property  that  F^^P)  is  linearly  separable  with  error  at  most  e  at  margin  7/4. 


Proof: 

By  Lemma  6.3. 1 ,  wifh  probabilify  al  leasl  1  —  5/2  fhere  exisls  a  separator  w  in  fhe  infermediale  space 
R^'^  wifh  error  al  mosl  el2  at  margin  7/2.  Lef  us  assume  fhis  in  facf  occurs.  Now,  consider  some  poinl 
X  £  R‘^^.  Theorem  6.4.1  implies  fhaf  a  choice  of  ds  =  0(;:plog(^))  is  sufficienl  so  fhaf  under  fhe  random 

projecfion  F,  wifh  probabilify  al  leasf  1  —  e5/4,  fhe  squared-lengfhs  of  w,  x,  and  m  —  x  are  all  preserved 
up  lo  mulfiplic alive  facfors  of  1  ±  7/I6.  This  fhen  implies  fhaf  fhe  cosine  of  fhe  angle  befween  w  and  x 


135 


(i.e.,  the  margin  of  x  with  respect  to  w)  is  preserved  up  to  an  additive  factor  of  ±7/4.  Specifically,  using 

we  have: 


X  =  and  w  =  ,  which  implies 

Ikll  Ihll  ^ 

F{w)  ■  F{x)  


IIFMII  ||F(x)||  \\F{w)\\  ||F(£) 

l(||F(m)||2  +  ||F(x)||2-||F(m)-F(x)||2) 


||F(u;)||||F(x)||  ||F(u;)||||F(f)|| 

G  [ri)  •  X  —  7/4,  rh  •  X  +  7/4]. 

In  other  words,  we  have  shown  the  following: 


For  all  X,  Pr 

A 


w  ■  X  F{w)  ■  F{x) 

V 

_ 1 

IHMNI  ||F(u;)||||F(x)|| 

<  ed/4. 


Since  the  above  is  true  for  all  x,  it  is  clearly  true  for  random  x  from  F2{D).  So, 

F{w)  ■  F{x) 


which  implies  that: 


Pr 

A 


Pr 


Pr 

X'^F2{D) 


W  ■  X 


M\\x\\  ||FH||||F(x)|| 


>  7/4 


<  £(5/4, 


W  ■  X 


F{w)  ■  F{x) 


kIMNI  ||FH||||F(x)|| 


>  7/4  >  e/2 


<  d/2. 


Since  w  has  error  at  most  e/2  at  margin  7/2,  this  then  implies  that  the  probability  that  F{w)  has  error 
more  than  e  over  F {F2{D))  at  margin  7/4  is  at  most  d/2.  Combining  this  with  the  d/2  failure  probability 
of  F2  completes  the  proof. 


As  before,  the  running  time  to  compute  our  mappings  is  polynomial  in  I/7, 1/e,  1/(5  and  the  time  to 
compute  the  kernel  function  K. 

Since  the  dimension  of  the  mapping  in  Theorem  6.4.2  is  only  logarithmic  in  1/e,  this  means  we  can 
set  e  to  be  small  enough  so  that  with  high  probability,  a  sample  of  size  0(^3  log  da)  would  be  perfectly 
separable.  This  means  we  could  use  any  noise-free  linear-separator  learning  algorithm  in  to  learn 
the  target  concept.  However,  this  requires  using  d2  =  0(1/7'^)  (i.e.,  0(1/7^)  unlabeled  examples  to 
construct  the  mapping). 

Corollary  6.4.3  Given  e' ,  (5, 7  <  1,  if  P  has  margin  7  in  the  f-space,  then  O(p^)  unlabeled  examples 
are  sufficient  so  that  with  probability  1  —  d,  mapping  F3  :  X  — >  has  the  property  that  F^{P)  is 

linearly  separable  with  error  o(e' / (da  log  da)),  where  d^  =  0{^  log  ^t^)- 

Proof:  Just  plug  in  the  desired  error  rate  into  the  bounds  of  Theorem  6.4.2.  ■ 


6.4.1  A  few  extensions 

So  far,  we  have  assumed  that  the  distribution  P  is  perfectly  separable  with  margin  7  in  the  i?5-space. 
Suppose,  however,  that  P  is  only  separable  with  error  a  at  margin  7.  That  is,  there  exists  a  vector  w  in 
the  (/)-space  that  correctly  classifies  a  1  —  o;  probability  mass  of  examples  by  margin  at  least  7,  but  the 
remaining  a  probability  mass  may  be  either  within  the  margin  or  incorrectly  classified.  In  that  case,  we 
can  apply  all  the  previous  results  to  the  1  —  o;  portion  of  the  distribution  that  is  correctly  separated  by 


136 


margin  7,  and  the  remaining  a  probability  mass  of  examples  may  or  may  not  behave  as  desired.  Thus  all 
preceding  results  (Lemma  6.3.1,  Corollary  6.3.2,  Theorem  6.3.3,  and  Theorem  6.4.2)  still  hold,  but  with 
e  replaced  by  (1  —  a)£  +  a  in  the  error  rate  of  the  resulting  mapping. 

Another  extension  is  to  the  case  that  the  target  separator  does  not  pass  through  the  origin:  that  is,  it  is 
of  the  form  w  ■  4>{x)  >  (5  for  some  value  /?.  If  </>  is  normalized,  so  that  ||0(x)||  =  1  for  all  x  €  X,  then 
all  results  carry  over  directly.  In  particular,  all  our  results  follow  from  arguments  showing  that  the  cosine 
of  the  angle  between  w  and  cj){x)  changes  by  at  most  e  due  to  the  reduction  in  dimension.  If  cj){x)  is  not 
normalized,  then  all  results  carry  over  with  7  replaced  by  7/i?,  where  R  is  an  upper  bound  on  1 1  (/)(x)  1 1 ,  as 
is  done  with  standard  margin  bounds  [44,  111,  191 1. 


6.5  On  the  necessity  of  access  to  D 

Our  algorithms  construct  mappings  F  :  X  ^  using  black-box  access  to  the  kernel  function  K{x,  y) 
together  with  unlabeled  examples  from  the  input  distribution  D.  It  is  natural  to  ask  whether  it  might 
be  possible  to  remove  the  need  for  access  to  D.  In  particular,  notice  that  the  mapping  resulting  from 
the  Johnson-Lindenstrauss  lemma  has  nothing  to  do  with  the  input  distribution:  if  we  have  access  to  the 
0-space,  then  no  matter  what  the  distribution  is,  a  random  projection  down  to  R'^  will  approximately 
preserve  the  existence  of  a  large-margin  separator  with  high  probability.^  So  perhaps  such  a  mapping 
F  can  be  produced  by  just  computing  K  on  some  polynomial  number  of  cleverly-chosen  (or  uniform 
random)  points  in  X.  (Let  us  assume  26  is  a  “nice”  space  such  as  the  unit  ball  or  {0, 1}"^  that  can  be 
randomly  sampled.)  In  this  section,  we  show  this  is  not  possible  in  general  for  an  arbitrary  black-box 
kernel.  This  leaves  open,  however,  the  case  of  specific  natural  kernels. 

One  way  to  view  the  result  of  this  section  is  as  follows.  If  we  define  a  feature  space  based  on  uniform 
binary  (Rademacher)  or  gaussian-random  points  in  the  0-space,  then  we  know  this  will  work  by  the 
Johnson-Lindenstrauss  lemma.  If  we  define  features  based  on  points  in  0(26)  (the  image  of  26  under  0) 
chosen  according  to  (j){D),  then  this  will  work  by  Corollary  6.3.2.  However,  if  we  define  features  based  on 
points  in  0(26)  chosen  according  to  some  method  that  does  not  depend  on  D,  then  there  will  exist  kernels 
for  which  this  does  not  work. 

In  particular,  we  demonstrate  the  necessity  of  access  to  D  as  follows.  Consider  26  =  {0, 1}”,  let  X' 
be  a  random  subset  of  2”/^  elements  of  26,  and  let  D  be  the  uniform  distribution  on  26'.  For  a  given  target 
function  c,  we  will  define  a  special  0-funcfion  0c  such  that  c  is  a  large  margin  separator  in  the  0-space 
under  distribution  D,  but  that  only  the  points  in  26'  behave  nicely,  and  points  not  in  X'  provide  no  useful 
information.  Specifically,  consider  0c  :  26  — >  defined  as: 


0c(x) 


(1,0) 

(-l/2,^/3/2) 

(-l/2,-^/3/2) 


ifx  0  X' 

if  X  G  26'  and  c(x)  =  1 
if  X  G  26'  and  c(x)  =  — 1 


See  figure  6.5. 1 .  This  then  induces  the  kernel: 


J  1  if  X,  2/  0  X'  or  [x,  y  G  X'  and  c(x)  =  c{y)\ 

\  —1/2  otherwise 


Notice  that  the  distribution  P  =  (U,  c)  over  labeled  examples  has  margin  7  =  -v/3/2  in  the  0-space. 

^To  be  clear  about  the  order  of  quantification,  the  statement  is  that  for  any  distribution,  a  random  projection  will  work  with 
high  probability.  However,  for  any  given  projection,  there  may  exist  bad  distributions.  So,  even  if  we  could  define  a  mapping  of 
the  sort  desired,  we  might  still  expect  the  algorithm  to  be  randomized. 


137 


Figure  6.5.1:  Function  (pc  used  in  lower  bound. 


Theorem  6.5.1  Suppose  an  algorithm  makes  polynomially  many  calls  to  a  black-box  kernel  function  over 
input  space  {0, 1}"^  and  produces  a  mapping  F  :  X  — >  where  d  is  polynomial  in  n.  Then  for  random 
X'  and  random  c  in  the  above  construction,  with  high  probability  F(P)  will  not  even  be  weakly-separable 
(even  though  P  has  margin  7  =  s/^l2  in  the  p-space). 

Proof:  Consider  any  algorithm  with  black-box  access  to  K  attempting  to  create  a  mapping  F  :  X  — >  R^. 
Since  X'  is  a  random  exponentially-small  fraction  of  X,  with  high  probability  all  calls  made  to  K  when 
constructing  the  function  F  are  on  inputs  not  in  X'.  Let  us  assume  this  indeed  is  the  case.  This  implies 
that  (a)  all  calls  made  to  K  when  constructing  the  function  F  return  the  value  1,  and  (b)  at  “runtime”  when 
X  chosen  from  D  (i.e.,  when  F  is  used  to  map  training  data),  even  though  the  function  F(x)  may  itself 
call  K{x,  y)  for  different  previously-seen  points  y,  these  will  all  give  K(x,  y)  =  —1/2.  In  particular,  this 
means  that  F{x)  is  independent  of  the  target  function  c.  Finally,  since  X'  has  size  2"^/^  and  d  is  only 
polynomial  in  n,  we  have  by  simply  counting  the  number  of  possible  partitions  of  F{X')  by  halfspaces 
that  with  high  probability  F(P)  will  not  even  be  weakly  separable  for  a  random  function  c  over  X'. 
Specifically,  for  any  given  halfspace,  the  probability  over  choice  of  c  that  it  has  error  less  than  1/2  —  e 
is  exponentially  small  in  \X'\  (by  Hoeffding  bounds),  which  is  doubly-exponentially  small  in  n,  whereas 
there  are  “only”  possible  partitions  by  halfspaces.  ■ 

Notice  that  the  kernel  in  the  above  argument  is  positive  semidefinite.  If  we  wish  to  have  a  positive 
definite  kernel,  we  can  simply  change  “1”  to  “1  —  a”  and  “—1/2”  to  “—5(1  —  a)”  in  the  definition  of 
K{x,  y),  except  for  y  =  x  in  which  case  we  keep  K(x,  y)  =  1.  This  corresponds  to  a  function  p  in  which 
rather  that  mapping  points  exactly  into  Pf,  we  map  into  giving  each  example  a  T/a-component  in 

its  own  dimension,  and  we  scale  the  first  two  components  by  \Jl  —  a  to  keep  pc{x)  a  unit  vector.  The 
margin  now  becomes  ^(1  —  a).  Since  the  modifications  provide  no  real  change  (an  algorithm  with  access 
to  the  original  kernel  can  simulate  this  one),  the  above  arguments  apply  to  this  kernel  as  well. 

One  might  complain  that  the  kernels  used  in  the  above  argument  are  not  efficiently  computable.  How¬ 
ever,  this  can  be  rectified  (assuming  the  existence  of  one-way  functions)  by  defining  X'  to  be  a  crypto¬ 
graphically  pseudorandom  subset  of  X  and  c  to  be  a  pseudorandom  function  [125|.  In  this  case,  except 
for  the  very  last  step,  the  above  argument  still  holds  for  polynomial-time  algorithms.  The  only  issue, 
which  arises  in  the  last  step,  is  that  we  do  not  know  any  polynomial-time  algorithm  to  test  if  F(P)  is 
weakly-separable  in  R^  (which  would  distinguish  c  from  a  truly-random  function  and  provide  the  needed 
contradiction).  Thus,  we  would  need  to  change  the  conclusion  of  the  theorem  to  be  that  “F(P)  is  not  even 


138 


weakly -leamable  by  a  polynomial  time  algorithm”. 

Of  course,  these  kernels  are  extremely  unnatural,  each  with  its  own  hidden  target  function  built  in. 
It  seems  quite  conceivable  that  positive  results  independent  of  the  distribution  D  can  be  achieved  for 
standard,  natural  kernels. 

6.6  Conclusions  and  Discussion 

We  show  how  given  black-box  access  to  a  kernel  function  K  and  a  distribution  D  (i.e.,  unlabeled  exam¬ 
ples)  we  can  use  K  and  D  together  to  efficiently  construct  a  new  low-dimensional  feature  space  in  which 
to  place  the  data  that  approximately  preserves  the  desired  properties  of  the  kernel.  Our  procedure  uses 
two  types  of  “random”  mappings.  The  first  is  a  mapping  based  on  random  examples  drawn  from  D  that 
is  used  to  construct  the  intermediate  space,  and  the  second  is  a  mapping  based  on  Rademacher/binary  (or 
Gaussian)  random  vectors  in  the  intermediate  space  as  in  the  Johnson-Lindenstrauss  lemma. 

Our  analysis  suggests  that  designing  a  good  kernel  function  is  much  like  designing  a  good  feature 
space.  It  also  provides  an  alternative  to  “kernelizing”  a  learning  algorithm:  rather  than  modifying  the 
algorithm  to  use  kernels,  one  can  instead  construct  a  mapping  into  a  low-dimensional  space  using  the 
kernel  and  the  data  distribution,  and  then  run  an  un-kernelized  algorithm  over  examples  drawn  from  the 
mapped  distribution. 

Our  main  concrete  open  question  is  whether,  for  natural  standard  kernel  functions,  one  can  produce 
mappings  F  :  X  — >  in  an  oblivious  manner,  without  using  examples  from  the  data  distribution.  The 

Johnson-Lindenstrauss  lemma  tells  us  that  such  mappings  exist,  but  the  goal  is  to  produce  them  without 
explicitly  computing  the  (/(-function.  Barring  that,  perhaps  one  can  at  least  reduce  the  unlabeled  sample- 
complexity  of  our  approach. 

On  the  practical  side,  it  would  be  interesting  to  explore  the  alternatives  that  these  (or  other)  mappings 
provide  to  widely  used  algorithms  such  as  SVM,  or  Kernel  Perceptron. 


139 


140 


Chapter  7 

Mechanism  Design,  Machine  Learning, 
and  Pricing  Problems 


In  this  chapter  we  make  an  explicit  connection  between  machine  learning  and  mechanism  design.  In 
particular,  we  show  how  Sample  Complexity  techniques  in  Statistical  Learning  Theory  can  be  used  to 
reduce  problems  of  incentive-compatible  mechanism  design  to  standard  algorithmic  questions,  for  a  wide 
range  of  revenue-maximizing  problems  in  an  unlimited  (or  unrestricted)  supply  setting. 

7.1  Introduction,  Problem  Formulation 

In  recent  years  there  has  been  substantial  work  on  problems  of  algorithmic  mechanism  design.  These 
problems  typically  take  a  form  similar  to  classic  algorithm  design  or  approximation-algorithm  questions, 
except  that  the  inputs  are  each  given  by  selfish  agents  who  have  their  own  interest  in  the  outcome  of  the 
computation.  As  a  result  it  is  desirable  that  the  mechanisms  (the  algorithms  and  protocol)  be  incentive 
compatible  —  meaning  that  it  is  in  each  agent’s  best  interest  to  report  its  true  value  —  so  that  agents  do 
not  try  to  game  the  system.  This  requirement  can  greatly  complicate  the  design  problem. 

In  this  work  we  consider  the  design  of  mechanisms  for  one  of  the  most  fundamental  economic  objec¬ 
tives:  profit  maximization.  Agents  participating  in  such  a  mechanism  may  choose  to  falsely  report  their 
preferences  if  it  might  benefit  them.  What  we  show,  however,  is  that  so  long  as  the  number  of  agents  is 
sufficiently  large  as  a  function  of  a  measure  of  the  complexity  of  the  mechanism  design  problem,  we  can 
apply  sample-complexity  techniques  from  learning  theory  to  reduce  this  problem  to  standard  algorithmic 
questions  in  a  broad  class  of  settings.  It  is  useful  to  think  of  the  techniques  we  develop  in  the  context  of 
designing  an  auction  to  sell  some  goods  or  services,  though  they  also  apply  in  more  general  scenarios. 

In  a  seminal  paper  Myerson  |I7Ij  derives  the  optimal  auction  for  selling  a  single  item  given  that 
the  bidders’  true  valuations  for  the  item  come  from  some  known  prior  distribution.  Following  a  trend 
in  the  recent  computer  science  literature  on  optimal  auction  design,  we  consider  the  prior-free  setting  in 
which  there  is  no  underlying  distribution  on  valuations  and  we  wish  to  perform  well  for  any  (sufficiently 
large)  set  of  bidders.  In  absence  of  a  known  prior  distribution  we  will  use  machine  learning  techniques 
to  estimate  properties  of  the  bidders’  valuations.  We  consider  the  unlimited  supply  setting  in  which  this 
problem  is  conceptually  simpler  because  there  are  no  infeasible  allocations;  though,  it  is  often  possible 
to  obtain  results  for  limited  supply  or  with  cost  functions  on  the  outcome  via  reduction  to  the  unlimited 
supply  case  [9,  106,  124|.  Research  in  optimal  prior-free  auction  design  is  important  for  optimal  auction 
design  because  it  directly  links  inaccurate  distributional  knowledge  typical  of  small  markets  with  loss  in 
performance. 


141 


Implicit  in  mechanism  design  problems  is  the  fact  that  the  selfish  agents  that  will  be  participating  in  the 
mechanism  have  private  information  that  is  known  only  to  them.  Often  this  private  information  is  simply 
the  agent’s  valuation  over  the  possible  outcomes  the  mechanism  could  produce.  For  example,  when  selling 
a  single  item  (with  the  standard  assumption  that  an  agent  only  cares  if  they  get  the  item  or  not  and  not 
whether  another  agent  gets  it)  this  valuation  is  simply  how  much  they  are  willing  to  pay  for  the  item.  There 
may  also  be  public  information  associated  with  each  agent.  This  information  is  assumed  to  be  available 
to  the  mechanism.  Such  information  is  present  in  structured  optimization  problems  such  as  the  knapsack 
auction  problem  [9|  and  multicast  auction  problem  |106|  and  is  the  natural  way  to  generalize  optimal 
auction  design  for  independent  but  non-identically  distributed  prior  distributions  (which  are  considered  by 
Myerson  1 171 1)  to  the  prior-free  setting.  There  are  many  standard  economic  settings  where  such  public 
information  is  available,  e.g.,  in  the  college  tuition  mechanism,  in-state  or  out-of-state  residential  status  is 
public;  for  acquiring  a  loan,  a  consumer’s  credit  report  is  public  information;  for  automobile  insurance, 
driving  records,  credit  reports,  and  the  make  and  color  of  the  vehicle  are  public  information. 

A  fundamental  building  block  of  an  incentive  compatible  mechanism  is  an  offer.  For  full  generality  an 
offer  can  be  viewed  as  an  incentive  compatible  mechanism  for  one  agent.  As  an  example,  if  we  are  selling 
multiple  units  of  a  single  item,  an  offer  could  be  a  take-it-or-leave-it  price  per  unit.  A  rational  agent  would 
accept  such  an  offer  if  it  is  lower  than  the  agent’s  valuation  for  the  item  and  reject  if  it  is  greater.  Notice 
that  if  all  agents  are  given  the  same  take-it-or-leave-it  price  then  the  outcome  is  non-discriminatory  and 
the  same  price  is  paid  by  all  winners.  Prior-free  auctions  based  on  this  type  of  non-discriminatory  pricing 
have  been  considered  previously  (see,  e.g.,  [  124|). 

One  of  the  main  motivations  of  this  work  is  to  explore  discriminatory  pricing  in  optimal  auction 
design.  There  are  two  standard  means  to  achieve  discriminatory  pricing.  The  first,  is  to  discriminate  based 
on  the  public  information  of  the  consumer.  Naturally,  loans  are  more  costly  for  individuals  with  poor 
credit  scores,  car  insurance  is  more  expensive  for  drivers  with  points  on  their  driving  record,  and  college 
tuition  at  state  run  universities  is  cheaper  for  students  that  are  in-state  residents.  In  this  setting  a  reasonable 
offer  might  be  a  mapping  from  the  public  information  of  the  agents  to  a  take-it-or-leave-it  price.  We  refer 
to  these  types  of  offers  as  pricing  functions.  The  second  standard  means  for  discriminatory  pricing  is  to 
introduce  similar  products  of  different  qualities  and  price  them  differently.  Consumers  who  cannot  afford 
the  expensive  high-quality  version  may  still  purchase  an  inexpensive  low-quality  version.  This  practice 
is  common,  for  example,  in  software  sales,  electronics  sales,  and  airline  ticket  sales.  An  offer  for  the 
multiple  good  setting  could  be  a  take-it-or-leave  it  price  for  each  good.  An  agent  would  then  be  free  to 
select  the  good  (or  bundle  of  goods)  with  the  (total)  price  that  they  most  prefer.  We  refer  to  these  types  of 
offers  as  item  pricings. 

Notice  that  allowing  offers  in  the  form  of  pricing  functions  and  item  pricings,  as  described  above, 
provides  richness  to  both  algorithmic  and  mechanism  design  questions.  This  richness;  however,  is  not 
without  cost.  Our  performance  bounds  are  parameterized  by  a  suitable  notion  of  the  complexity  of  the 
class  of  allowable  offers.  It  is  natural  that  this  kind  of  complexity  should  affect  the  ability  of  a  mechanism 
to  optimize.  It  is  easier  to  approximate  the  optimal  offer  from  a  simple  classes  of  offers,  such  as  take- 
it-or-leave-it  prices  for  a  single  item,  than  it  is  for  a  more  complex  class  of  offers,  such  as  take-it-or- 
leave-it  prices  for  multiple  items.  Our  prior-free  analysis  makes  the  relationship  between  a  mechanism’s 
performance  and  the  complexity  of  allowed  offers  precise. 

We  phrase  our  auction  problem  generically  as:  given  some  class  of  reasonable  offers,  can  we  construct 
an  incentive-compatible  auction  that  obtains  profit  close  to  the  profit  obtained  by  the  optimal  offer  from 
this  class?  The  auctions  we  discuss  are  generalizations  of  the  random  sampling  auction  of  Goldberg  et 
al.  [121|.  These  auctions  make  use  of  a  (non-incentive-compatible)  algorithm  for  computing  a  best  (or 
approximately  best)  offer  from  a  given  class  for  any  set  of  consumers.  Thus,  we  can  view  this  construction 


142 


as  reducing  the  optimal  mechanism  design  problem  to  the  optimal  algorithm  design  problem. 

The  idea  of  the  reduction  is  as  follows.  Let  A  be  an  algorithm  (exact  or  approximate)  for  the  purely 
algorithmic  problem  of  finding  the  optimal  offer  in  some  class  Q  for  any  given  set  of  consumers  S  with 
known  valuations.  Our  auction,  which  does  not  know  the  valuations  a  priori,  asks  the  agents  to  report 
their  valuations  (as  bids),  splits  agents  randomly  into  two  sets  Si  and  S2,  runs  the  algorithm  A  separately 
on  each  set  (perhaps  adding  an  additional  penalty  term  to  the  objective  to  penalize  solutions  that  are  too 
“complex”  according  to  some  measure),  and  then  applies  the  offer  found  for  Si  to  S2  and  the  offer  found 
on  S2  to  Si .  The  incentive  compatibility  of  this  auction  allows  us  to  assume  that  the  agents  will  indeed 
report  their  true  valuations.  Sample-complexity  techniques  adapted  from  machine  learning  theory  can 
then  give  a  guarantee  on  the  quality  of  the  results  if  the  market  size  is  sufficiently  large  compared  to  a 
measure  of  complexity  of  the  class  of  possible  solutions.  From  an  economics  perspective,  this  can  be 
viewed  as  replacing  the  Bayesian  assumption  that  bidders  come  from  a  known  prior  distribution  (e.g.,  as 
in  Myerson’s  work  1 171 1)  with  the  use  of  learning,  over  a  random  subset  Si  of  an  arbitrary  set  of  bidders 
S,  to  get  enough  information  to  apply  to  S2  (and  vice  versa). 

It  is  easy  to  see  that  as  the  size  of  the  market  grows,  the  law  of  large  numbers  indicates  that  the  above 
approach  is  asymptotically  optimal.  This  is  not  surprising  as  conventional  economic  wisdom  suggests  that 
even  the  approach  of  market  analysis  followed  by  the  Bayesian  optimal  mechanism  would  incur  negligibly 
small  loss  compared  to  the  Bayesian  optimal  mechanism  which  was  endowed  with  foreknowledge  of  the 
distribution.  In  contrast,  the  main  contribution  of  this  work  is  to  give  a  mechanism  with  upper  bounds  on 
the  convergence  rate,  i.e.,  the  relationship  between  the  size  of  the  market,  the  approximation  factor,  and 
the  complexity  of  the  class  of  reasonable  offers. 

Our  contributions:  We  present  a  general  framework  for  reducing  problems  of  incentive-compatible 
mechanism  design  to  standard  algorithmic  questions,  for  a  broad  class  of  revenue-maximizing  pricing 
problems.  To  obtain  our  bounds  we  use  and  extend  sample-complexity  techniques  from  machine  learn¬ 
ing  theory  (see  [18,  69,  149,  203])  and  to  design  our  mechanisms  we  employ  machine  learning  methods 
such  as  structural  risk  minimization.  In  general  we  show  that  an  algorithm  (or  /3-approximation)  can  be 
converted  into  a  (1  -|-  e) -approximation  (or  /3(1  -|-  e) -approximation)  for  the  optimal  mechanism  design 
problem  when  the  market  size  is  at  least  0(/3e“^)  times  a  reasonable  notion  of  the  complexity  of  the  class 
of  offers  considered.  Our  formulas  relating  the  size  of  the  market  to  the  approximation  factor  give  upper 
bounds  on  the  performance  loss  due  to  unknown  market  conditions  and  we  view  these  as  bounds  on  the 
convergence  rate  of  our  mechanism.  From  a  learning  perspective,  the  mechanism-design  setting  presents 
a  number  of  technical  challenges  when  attempting  to  get  good  bounds:  in  particular,  the  payoff  function 
is  discontinuous  and  asymmetric,  and  the  payoffs  for  different  offers  are  non-uniform.  For  example,  in 
Section  7.3.3  we  develop  bounds  based  on  a  different  notion  of  covering  number  than  typically  used  in 
machine  learning,  in  order  to  obtain  results  that  are  more  meaningful  for  our  setting. 

We  instantiate  our  framework  for  a  variety  of  problems,  some  of  which  have  been  previously  consid¬ 
ered  in  the  literature,  including: 

Digital  Good  Auction  Problem:  The  digital  good  auction  problem  considers  the  sale  of  an  unlimited 
number  of  units  of  an  item  to  indistinguishable  consumers,  and  has  been  considered  by  Goldberg  et 
al.  1 121 1  and  a  number  of  subsequent  papers.  As  argued  in  [  121  j  the  only  reasonable  offers  for  this 
setting  are  take-it-or-leave-it  prices. 

The  analysis  techniques  developed  in  our  work  give  a  simple  proof  that  the  random  sampling  auction 
(related  to  that  of  [  121  j)  obtains  a  (1  —  e)  fraction  of  the  optimal  offer  as  long  as  the  market  size  is 
at  least  0(^  log  ^)  (where  h  is  an  upper  bound  on  the  valuation  of  any  agent). 

Attribute  Auction  Problem:  The  attribute  auction  problem  is  an  abstraction  of  the  problem  using  dis- 


143 


criminatory  prices  based  on  public  information  (a.k.a.,  attributes)  of  the  agents.  A  seller  can  often 
increase  its  profit  by  using  discriminatory  pricing:  for  example,  the  motion  picture  industry  uses 
region  encodings  so  that  they  can  charge  different  prices  for  DVDs  sold  in  different  markets.  Fur¬ 
ther,  in  many  generalizations  of  the  digital  good  auction  problem,  the  agents  are  distinguishable  via 
public  information  so  the  techniques  exposed  in  the  study  of  attribute  auctions  are  fundamental  to 
the  study  of  profit  maximization  in  general  settings. 

Here  a  reasonable  class  of  offers  to  consider  are  mappings  from  the  agents’  attributes  to  take-it- 
or-leave-it  prices.  As  such,  we  refer  to  these  offers  as  pricing  functions.  For  example,  for  one¬ 
dimensional  attributes,  a  natural  class  of  pricing  functions  might  be  piece-wise  constant  functions 
with  k  prices,  as  studied  in  [60|.  In  our  work  we  give  a  general  treatment  that  can  be  applied 
to  arbitrary  classes  of  pricing  functions.  For  example,  if  attributes  are  multi-dimensional,  pricing 
functions  might  involve  partitioning  agents  into  markets  defined  by  coordinafe  values  or  by  some 
nafural  clusfering,  and  fhen  offering  a  consfanf  price  or  a  price  fhaf  is  some  ofher  simple  funclion  of 
fhe  affribufes  wifhin  each  markef.  Our  bounds  give  a  (1  -|-  e) -approximation  when  fhe  markef  size 
is  large  in  comparison  fo  scaled  by  a  suifable  notion  of  fhe  complexify  of  fhe  class  of  offers. 

Combinatorial  Auction  Problem:  We  also  consider  the  goal  of  profit  maximization  in  an  unlimited- 
supply  combinatorial  auction.  This  generalizes  the  digital  good  auction  and  exemplifies  the  problem 
of  discriminatory  pricing  through  the  sale  of  multiple  products.  The  setting  here  is  the  following.  We 
have  m  different  items,  each  in  unlimited  supply  (like  a  supermarket),  and  bidders  have  valuations 
over  subsets  of  items.  Our  goal  is  to  achieve  revenue  nearly  as  large  as  the  best  revenue  that  uses 
take-it-or-leave-it  prices  for  each  item  individually,  i.e.,  the  best  item-pricing. 

For  arbitrary  item  pricings  we  show  that  our  reduction  has  a  convergence  rate  of  D  no 

matter  how  complicated  those  bidders’  valuations  are  (where  the  D  hides  terms  logarithmic  in  n,  the 
number  of  agents;  m,  the  number  of  items;  and  h,  the  highest  valuation).  If  instead  the  specification 
of  the  problem  constrains  the  item  prices  to  be  integral  (e.g.,  in  pennies)  or  the  consumers  to  be  unit- 
demand  (desiring  only  one  of  several  items)  or  single-minded  (desiring  only  a  particular  bundle  of 
items)  then  our  bound  improves  to  D  (^).  This  improves  on  the  bounds  given  by  |119|  for  the 
unit-demand  case  by  roughly  a  factor  of  m. 

A  special  case  of  this  setting  is  the  problem  of  auctioning  the  right  to  traverse  paths  in  a  network. 
When  the  network  is  a  tree  and  each  user  wants  to  reach  the  root  (like  drivers  commuting  into 
a  city  or  a  multicast  tree  in  the  Internet),  Guruswami  et  al.  1 129|  give  an  exact  algorithm  for  the 
algorithmic  problem  to  which  our  reduction  applies  as  noted  above. 

Related  Work:  Several  papers  [  60 ,  65 1  have  applied  machine  learning  techniques  to  mechanism  design  in 
the  context  of  maximizing  revenue  in  online  auctions.  The  online  setting  is  more  difficult  than  the  “batch” 
setting  we  consider,  but  the  flip-side  is  that  as  a  result,  that  work  only  applies  to  quite  simple  mechanism 
design  settings  where  the  class  Q  of  allowable  offers  has  small  size  and  can  be  easily  listed.  Also,  in 
a  similar  spirit  to  the  goals  of  our  work,  Awerbuch  et  al.  [22]  give  reductions  from  online  mechanism 
design  to  online  optimization  for  a  broad  class  of  revenue  maximization  problems.  Their  work  compares 
performance  to  the  sum  of  bidders’  valuations,  a  quite  demanding  measure.  As  a  result,  however,  their 
approximation  factors  are  necessarily  logarithmic  rather  than  (1  -|-  e)  as  in  our  results. 

Structure  of  this  chapter:  The  structure  of  the  chapter  is  as  follows.  We  describe  the  general  setting  in 
which  our  results  apply  in  Section  7.2  and  give  our  generic  reduction  and  bounds  Section  7.3  We  then 
apply  our  techniques  to  the  digital  good  auction  problem  (Section  7.4),  attribute  auction  problems  (Sec¬ 
tion  7.5 ),  the  problem  of  item-pricing  in  combinatorial  auctions  (Section  7.6).  We  present  our  conclusions 


144 


in  Section  7.7 


7.2  Model,  Notation,  and  Definitions 

7.2.1  Abstract  Model 

We  assume  a  set  S'  =  {1, . . . ,  n}  of  agents.  At  the  heart  of  our  approach  to  mechanism  design  is  the 
idea  that  the  interaction  between  a  mechanism  and  an  agent  results  from  the  combination  of  an  agent’s 
preference  with  an  offer  made  by  the  mechanism.  The  precise  notion  of  what  preferences  and  offers  are 
will  depend  on  the  setting  and  is  defined  in  Section  7.2.2  However,  fixing  the  preference  of  agent  i  and 
an  offer  g  we  let  g{i)  represent  the  payment  made  to  the  mechanism  when  agent  z’s  preference  is  applied 
to  the  offer  g.  Essentially,  we  are  letting  the  structure  of  an  agent’s  preference  and  the  structure  of  the 
offer  be  represented  solely  by  g{i).  We  extend  our  notation  to  allow  g{S)  to  be  the  total  profit  when 
offering  g  to  all  agents  in  S,  and  we  assume  that  g{S)  =  effectively  corresponds  to  an 

unlimited-supply  assumption  in  the  auction  setting. 

In  our  setting  we  have  a  class  Q  of  allowable  offers.  Our  problem  will  be  to  find  offers  in  Q  to  make  to 
the  agents  to  maximize  our  profit.  For  this  abstract  setting  we  propose  an  algorithmic  optimization  problem 
and  a  mechanism  design  problem,  the  difference  being  that  in  the  former  we  constrain  the  algorithm  to 
make  the  same  offer  to  all  agents,  and  in  the  latter  the  mechanism  is  constrained  by  lack  of  prior  knowledge 
of  the  agents’  true  preferences  and  must  be  incentive  compatible. 

Given  the  true  preferences  of  S  and  a  class  of  offers  Q,  the  algorithmic  optimization  problem  is  to 
find  the  g  ^  Q  with  maximum  profit,  i.e.,  optg{S)  =  argmax^g^  9{S).  Let  OPTp(S')  =  maxg^g  g{S) 
be  this  maximum  profit.  This  computational  problem  is  interesting  in  its  own  right,  especially  when  the 
structure  of  agent  preferences  and  the  allowable  offers  results  in  a  concise  formula  for  g{i)  for  all  <7  G  ^ 
and  all  i  £  S.  All  of  the  techniques  we  develop  assume  that  such  an  algorithm  (or  an  approximation  to  it) 
exists,  and  some  require  existence  of  an  algorithm  that  optimizes  over  the  profit  of  an  offer  minus  some 
penalty  term  that  is  related  to  the  complexity  of  the  offer,  i.e.,  maxg^g  [^(5)  —  peng(S')] . 

We  now  define  an  abstract  mechanism-design-like  problem  that  is  modelled  after  the  standard  charac¬ 
terization  of  single-round  sealed-bid  direct-revelation  incentive-compatible  mechanisms  (see  below).  For 
the  class  of  offers  Q,  each  agent  has  a  payoff  profile  which  lists  the  payment  they  would  make  for  each 
possible  offer,  i.e.,  [g{i)]g^g  for  agent  i  (notice  that  this  represents  all  of  the  relevant  information  in  agent 
i’s  preference).  Our  abstract  mechanism  chooses  an  offer  gi  for  each  agent  i  in  a  way  that  is  independent 
of  that  agent’s  payoff  profile,  but  can  be  a  function  of  the  agent’s  identity  and  the  payoff  profiles  of  other 
agents.  That  is,  for  some  function  /,  gi  =  /(i,  [g{j)]g^g  j^j).  The  mechanism  then  selects  the  outcome 
for  agent  i  determined  by  their  preference  and  pj,  which  nets  a  profit  of  gi{i).  The  total  profit  of  such 
a  mechanism  is  gi{i).  We  define  an  abstract  deterministic  mechanism  to  be  completely  specified  by 
such  a  function  /  and  an  abstract  randomized  mechanism  is  a  randomization  over  abstract  deterministic 
mechanisms.  The  main  design  problem  considered  in  our  work  is  to  come  up  with  a  mechanism  (e.g.,  an 
/  or  randomization  over  functions  /)  to  maximize  our  (expected)  profit. 

Our  approach  is  through  a  reduction  from  the  mechanism  design  problem  to  the  algorithm  design 
problem  that  is  applicable  at  this  level  of  generality  (both  design  and  analysis),  though  tighter  analysis  is 
possible  when  we  expose  more  structure  in  the  agent  preferences  and  class  of  offers  (as  described  next). 
Our  bounds  make  use  of  a  parameter  h  which  upper  bounds  on  the  value  of  g{i)  for  alH  G  S'  and  g  £  Q', 
that  is,  no  individual  agent  can  influence  the  total  profit  by  more  than  h.  The  auctions  we  describe  that 
make  use  of  the  technique  of  structural  risk  minimization  will  need  to  know  h  in  advance. 


145 


7.2.2  Offers,  Preferences,  and  Incentives 

To  describe  how  the  framework  above  allows  us  to  consider  a  large  class  of  mechanism  design  problems, 
we  formally  discuss  the  details  of  offers,  agent  preferences,  and  the  constraints  imposed  by  incentive 
compatibility.  To  do  this  we  develop  some  notation;  however,  the  main  results  in  our  work  will  be  given 
using  the  general  framework  above. 

Formally,  a  market  consists  of  a  set  of  n  agents,  S,  and  a  space  of  possible  outcomes,  O.  We  consider 
unlimited  supply  allocation  problems  where  Oi  is  set  of  possible  outcomes  (allocations)  to  agent  i  and 
O  =  Oi  X  ■  ■  ■  X  On  (i.e.,  all  possible  combinations  of  allocations  are  feasible).  Except  where  noted,  we 
assume  there  is  no  cost  to  the  mechanism  for  producing  any  outcome. 

As  is  standard  in  the  mechanism  design  literature  |177|,  an  agent  i’s  preference  is  fully  specified  by 
its  private  type,  which  we  denote  Vi.  We  assume  no  externalities,  which  means  that  Vi  can  be  viewed  as  a 
preference  ordering,  over  (outcome,  payment)  pairs  in  Oi  x  TZ.  That  is,  each  agent  cares  only  about 
what  outcome  it  receives  and  pays,  and  not  about  what  other  agents  get.  A  bid,  bi,  is  a  reporting  of  one’s 
type,  i.e.,  it  is  also  a  preference  ordering  over  (outcome,  payment)  pairs,  and  we  say  a  bidder  is  bidding 
truthfully  if  the  preference  ordering  under  bi  matches  that  given  by  its  true  type,  Vi. 

A  deterministic  mechanism  is  incentive  compatible  if  for  all  agents  i  and  all  actions  of  the  other 
agents,  bidding  truthfully  is  at  least  as  good  as  bidding  non-truthfully.  If  Oi{bi,  b_i)  and  Pi{bi,  b_j)  are 
the  outcome  and  payment  when  agent  i  bids  and  the  other  agents  bid  b_j,  then  incentive  compatibility 
requires  for  all  Vi,  bi,  and  b_j, 

{oi{vi,h-i),pi{vi,h-i))  Ay.  {oi{bi,h-i),pi{bi,h-i)). 

A  randomized  mechanism  is  incentive  compatible  if  it  is  a  randomization  over  deterministic  incentive 
compatible  mechanisms. 

An  offer,  as  described  abstractly  in  the  preceding  section,  need  not  be  anonymous.  This  allows  the 
freedom  to  charge  different  agents  different  prices  for  the  same  outcome.  In  particular,  for  a  fixed  offer 
g,  the  payment  to  two  agents,  g{i)  and  g{i'),  may  be  different  even  if  bi  =  bi'.  We  consider  a  structured 
approach  to  this  sort  of  discriminatory  pricing  by  associating  to  each  agent  i  some  publicly  observable 
attribute  value  pub^.  An  ojfer  then  is  a  mapping  from  a  bidder’s  public  information  to  a  collection  of 
(outcome,  payment)  pairs  which  the  agent’s  preference  ranks.  We  interpret  making  an  offer  to  an  agent 
as  choosing  the  outcome  and  payment  that  they  most  prefer  according  to  their  reported  preference.  For 
an  incentive  compatible  mechanism,  where  we  can  assume  that  Vi  =  bi,  g{i)  is  the  payment  component 
of  this  (outcome,  payment)  pair.  Clearly,  the  mechanism  that  always  makes  every  agent  a  fixed  offer  is 
by  definifion  incenfive-compafible.  In  facf  fhe  following  more  general  resulf,  which  mofivafes  fhe  above 
definition  of  an  absfracf  mechanism,  is  easy  fo  show: 

Fact  7.2.1  A  mechanism  is  incenfive  compatible  if  the  choice  of  which  offer  to  make  to  any  agent  does  not 
depend  on  the  agent’s  reported  preference. 

Because  all  our  mechanisms  are  incenfive  compatible,  fhe  esfablished  nofafion  of  g{i)  as  fhe  profif  of 
offer  g  on  agenf  i  will  be  sufficienf  for  mosf  discussions  and  we  will  omif  explicif  reference  fo  Vi  and  bi 
where  possible. 

7.2.3  Quasi-linear  Preferences 

We  will  apply  our  general  framework  and  analysis  fo  a  number  of  special  cases  where  fhe  agenfs’  pref¬ 
erences  are  fo  maximize  fheir  quasi-linear  utility.  This  is  fhe  mosf  sfudied  case  in  mechanism  design 
liferafure.  The  fype,  r;*,  of  a  quasi-linear  ufilify  maximizing  agenf  i  specifies  ifs  valuation  for  each  ouf- 
come.  We  denofe  fhe  valuation  of  agenf  i  for  oufcome  o*  G  Oi  as  Uj(oj).  This  agenf ’s  utility  is  fhe 


146 


difference  between  its  valuation  and  the  price  it  is  required  to  pay.  I.e.,  for  outcome  Oj  and  payment  pi, 
agent  i’s  utility  is  ui  =  Vi{oi)  —  pi.  An  agent  prefers  the  outcome  and  payment  that  maximizes  its  utility. 
I.e.,  Vi{oi)  -pi>  Ui(o')  -  p-  if  and  only  if  {oi,pi)  (o',p'). 

For  the  quasi-linear  case,  the  incentive  compatibility  constraints  imply  for  all  Vi,  bi,  and  b_i  that, 

Vi{oi{vi,h_i))  -pi{vi,h_i)  >  Vi{oi{bi,h_i))  -pi{bi,h_i). 

Notice  that  in  the  quasi-linear  setting  our  constraint  that  g{i)  <  h  would  be  implied  by  the  condition 
that  Vi{oi)  <  h  for  all  Oi  G  Oi. 


7.2.4  Examples 


The  following  examples  illustrate  the  relationship  between  the  outcome  of  the  mechanism,  offers,  valua¬ 
tions,  and  attributes.  (The  first  three  examples  are  quasi-linear,  the  fourth  is  not.) 


Digital  Good  Auction:  The  digital  good  auction  models  an  auction  of  a  single  item  in  unlimited  supply 
to  indistinguishable  bidders.  Here  the  set  of  possible  outcomes  for  bidder  i  is  Oi  =  {0, 1}  where 
Oi  =  1  represents  bidder  i  receiving  a  copy  of  the  good  and  Oi  =  0  otherwise.  We  normalize  their 
valuation  function  Uj(0)  =  0  and  use  a  simple  shorthand  notation  of  Vi  =  Ui(l)  as  the  bidders 
privately  known  valuation  for  receiving  the  good.  As  described  in  the  introduction,  in  this  setting 
the  bidders  have  no  public  information.  Here,  a  natural  class  of  offers,  Q,  is  the  class  of  all  take-it- 
or-leave-it  prices.  For  bidder  i  with  valuation  Vi  and  offer  pp  =  “take  the  good  for  $p,  or  leave  it” 
the  profit  is 


9p{i) 


p  if  p<  Vi 
0  otherwise. 


We  consider  the  digital  good  auction  problem  in  detail  in  Section  7.4 

Attribute  Auctions:  This  is  the  same  as  the  digital  good  setting  except  now  each  bidder  i  is  associated 
a  public  attribute,  pubj^  G  X,  where  X  is  the  attribute  space.  We  view  X  as  an  abstract  space,  but 
one  can  envision  it  as  for  example.  Let  "P  be  a  class  of  pricing  functions  from  X  to  1Z+,  such 
as  all  linear  functions,  or  all  functions  that  partition  X  into  k  markets  in  some  natural  way  (say, 
based  on  distance  to  k  cluster  centers)  and  offer  a  different  price  in  each.  Let  Q  be  the  class  of 
take-it-or-leave-it  offers  induced  by  V.  That  is,  if  p  G  P  is  a  pricing  function,  then  the  offer  Qp  £  Q 
induced  by  p  is:  “for  bidder  i,  take  the  good  for  $p{pubj^),  or  leave  it”.  The  profit  to  the  mechanism 
from  bidder  i  with  valuation  Vi  and  public  information  pubi  is 


9p{i) 


pipubi)  if  pipubi)  <  Vi, 
0  otherwise. 


We  will  give  analyses  for  several  interesting  classes  of  pricing  functions  in  Section  7.5 . 

Combinatorial  Auctions:  Here  we  have  a  set  J  of  m  distinct  items,  each  in  unlimited  supply.  Each 
consumer  has  a  private  valuation  Vi{J')  for  each  bundle  J'  C  J  of  items,  which  measures  how  much 
receiving  bundle  J'  would  be  worth  to  the  consumer  i  (again  we  normalize  such  that  Vi{%)  =  0).  For 
simplicity,  we  assume  bidders  are  indistinguishable,  i.e.,  there  is  no  public  information.  A  natural 
class  of  offers  Q  (studied  in  [  129 1)  is  the  class  of  functions  that  assign  a  separate  price  to  each  item, 
such  that  the  price  of  a  bundle  is  just  the  sum  of  the  prices  of  the  items  in  it  (called  item  pricing). 
For  price  vector  p  =  (pi , . . . ,  pm)  let  the  offer  pp  =  “for  bundle  J',  pay  Pj”-  The  profit  for 


147 


bidder  i  on  offer  is 


(If  the  bundle  J'  maximizing  the  bidder’s  utility  is  not  unique,  we  define  the  mechanism  to  select 
the  utility-maximizing  bundle  of  greatest  profit.)  We  discuss  combinatorial  auctions  in  Section  7.6 

Marginal  Cost  Auctions  with  Budgets:  To  illustrate  an  interesting  model  with  agents  in  a  non-quasi- 
linear  setting  consider  the  case  each  bidder  f’s  preference  is  given  tuple  {Bi,  Vi)  where  Bi  is  their 
budget  and  Vi  is  their  value-per-unit  received.  Possible  allocations  for  bidder  i,  Oi,  are  non-negative 
real  numbers  corresponding  to  the  number  of  units  they  receive.  Assuming  their  total  payment  is 
less  than  their  budget,  bidder  z’s  utility  is  simply  ViOi  minus  their  payment;  a  bidder’s  utility  when 
payments  exceed  their  budget  is  negative  infinity. 

We  assume  that  the  seller  has  a  fixed  marginal  cost  c  for  producing  a  unit  of  the  good.  Consider  the 
class  of  offers  Q  with  gp  =  “pay  $p  per  unit  received”.  A  bidder  i  faced  with  offer  gp  with  p  <  Vi 
will  maximize  their  utility  by  buying  enough  units  to  exactly  exhaust  their  budget.  The  payoff  to 
the  auctioneer  for  this  bidder  i  is  therefor  Bi  less  c  times  the  number  of  units  the  bidder  demands. 

I.e., 


Bi-cBi/p  ifp<Vi, 

0  otherwise. 


This  model  is  quite  similar  to  one  considered  by  Borgs  et  al.  [70|.  Though  we  do  not  explicitly 
analyze  this  setting,  it  is  simple  to  apply  our  generic  analysis  to  get  reasonable  bounds. 

7.3  Generic  Reductions 

We  are  interested  in  reducing  incentive-compatible  mechanism  design  to  the  (non-incentive-compatible) 
algorithmic  optimization  problem.  Our  reductions  will  be  based  on  random  sampling.  Let  A  be  an  algo¬ 
rithm  (exact  or  approximate)  for  the  algorithmic  optimization  problem  over  Q.  The  simplest  mechanism 
that  we  consider,  which  we  call  RSO(g  _4)  (Random  Sampling  Optimal  offer),  is  the  following  generaliza¬ 
tion  of  the  random  sampling  digital-goods  auction  from  1 121 1: 

0.  Bidders  commit  to  their  preferences  by  submitting  their  bids. 

1.  Randomly  split  the  bidders  into  two  groups  Si  and  S2  by  flipping  a  fair  coin  for  each  bidder  to 
determine  its  group. 

2.  Run  A  to  determine  the  best  (or  approximately  best)  offer  gi  ^  G  over  Si,  and  similarly  the  best 
(or  approximately  best)  g2  ^  G  over  S'2. 

3.  Finally,  apply  gi  to  all  bidders  in  S2  and  52  to  all  bidders  in  5i  using  their  reported  bids. 

We  will  also  consider  various  more  refined  versions  of  RSO(g  ^4)  that  discretize  G  or  perform  some  type  of 
structural  risk  minimization  (in  which  case  we  will  need  to  assume  A  can  optimize  over  the  modifications 
made  to  ^). 

Note  1:  One  might  think  that  the  “leave-one-out”  mechanism,  where  the  offer  made  to  a  given  bidder  i 
is  the  best  offer  for  all  other  bidders,  i.e.,  optg(S'  \  {f}),  would  be  a  better  mechanism  than  the  random 
sampling  mechanism  above.  However,  as  pointed  out  in  [121,  124],  such  a  mechanism  (and  indeed, 
any  symmetric  deterministic  mechanism)  has  poor  worst-case  revenue.  Furthermore,  even  if  bidders’ 
valuations  are  independently  drawn  from  some  distribution,  the  leave-one-out  revenue  can  be  much  less 


148 


stable  than  RSO(g  _4)  in  that  it  may  have  a  non-negligable  probability  of  achieving  revenue  that  is  far  from 
optimal,  whereas  such  an  event  is  exponentially  small  for  RSO(g  yi).' 

Note  2:  The  reader  will  notice  that  in  converting  an  algorithm  for  finding  the  best  offer  in  Q  into  an 
incentive-compatible  mechanism,  we  produce  a  mechanism  whose  outcome  is  not  simply  that  of  a  single 
offer  applied  to  all  consumers.  For  example,  even  in  the  simplest  case  of  auctioning  a  digital  good  to 
indistinguishable  bidders,  we  compare  our  performance  to  the  best  take-it-or-leave-it  price,  and  yet  the 
auction  itself  does  not  in  fact  offer  each  bidder  the  same  price  (all  bidders  in  5i  get  the  same  price,  and 
all  bidders  in  S2  get  the  same  price,  but  those  two  prices  may  be  different).  In  fact,  Goldberg  and  Hartline 
[120]  show  that  this  sort  of  behavior  is  necessary:  it  is  not  possible  for  an  incentive-compatible  auction  to 
approximately  maximize  profit  and  offer  all  the  bidders  the  same  price. 

7.3.1  Generic  Analyses 

The  following  theorem  shows  that  the  random  sampling  auction  incurs  only  a  small  loss  in  performance 
if  the  profit  of  the  optimal  offer  is  large  in  comparison  to  the  logarithm  of  the  number  of  offers  we  are 
choosing  from.  Later  sections  of  this  chapter  will  focus  on  techniques  for  bounding  the  effective  size  (or 
complexity)  of  Q  that  can  yield  even  stronger  guarantees. 

Theorem  7.3.1  Given  the  offer  class  Q  and  a  fd-approximation  algorithm  A  for  optimizing  over  Q,  then 
with  probability  at  least  1  —  6  the  profit  ofRSO(^  j\y  is  at  least  (1  —  e)OPTp / (3  as  long  as 

OPTp  >/3^1n(^). 

Notice  that  this  bound  holds  for  all  e  and  6  simultaniously  as  these  are  not  parameters  of  the  mecha¬ 
nism.  In  particular,  this  bound  and  those  given  by  the  two  immediate  corollaries,  below,  show  how  the 
approximation  factor  improves  as  a  function  of  market  size. 

Corollary  7.3.2  Given  the  offer  class  Q  and  a  (3-approximation  algorithm  A  for  optimizing  over  Q,  then 
with  probability  at  least  1  —  5,  the  profit  of  RSO^g  j^)  is  at  least  (1  —  e)OPTp//3,  when  OPTp  >  n  and 
the  number  of  bidders  n  satisfies 


Corollary  7.3.3  Given  the  offer  class  Q  and  a  (d-approximation  algorithm  A  for  optimizing  over  Q  then 
with  probability  at  least  1  —  5,  the  profit  ofRSO(^g^^^  is  at  least 

(l-e)OPTp//3-i^ln(^). 

If  bidders’  valuations  are  in  the  interval  [1,  h]  and  the  take-it-or-leave-it  offer  of  $1  is  in  Q,  then  the 
condition  OPTp  >  n  is  trivially  satisfied  and  Corollary  7.3.2  can  be  inferprefed  as  giving  a  bound  on  fhe 
convergence  rate  of  fhe  random  sampling  aucfion.  Corollary  7.3.3  is  a  useful  form  of  our  bound  when 
considering  sfrucfural  risk  minimization  and  if  also  mafches  fhe  form  of  bounds  given  in  prior  work  (e.g., 
[601). 

For  example,  in  fhe  digifal  good  aucfion  wifh  fhe  class  of  offers  consisfing  of  all  fake-if-or-leave-if 
offers  in  fhe  inferval  [l,/i]  discretized  fo  powers  of  1  -|-  e,  we  have  OPTp^  >  n  (since  each  bidder’s 

'For  example,  say  we  are  selling  just  one  item  and  the  distribution  over  valuations  is  50%  probability  of  valuation  1  and  50% 
probability  of  valuation  2.  If  we  have  n  bidders,  then  there  is  a  nontrivial  chance  (about  Ij  ffn)  that  there  will  be  the  exact  same 
number  of  each  type  (n/2  bidders  with  valuation  1  and  n/2  bidders  with  valuation  2),  and  the  mechanism  will  make  the  wrong 
decision  on  everybody.  The  RSO{g,^)  mechanism  on  the  other  hand  has  only  an  exponentially  small  probability  of  doing  this 
poorly. 


149 


valuation  is  at  least  \),  (5  =  1  (since  the  algorithmic  problem  is  easy),  and  \Q^\  =  [log]^_,_£ /i] .  So, 
Corollary  7.3.2  states  that  0{^  loglog^.,.^  h)  bidders  are  sufficient  to  perform  nearly  as  well  as  optimal 
(we  derive  better  bounds  for  this  problem  in  Section  7.4). 

In  general  we  will  give  our  bounds  in  a  similar  form  as  Theorem  7.3.  T  knowing  that  bounds  of 
the  form  of  Corollary  7.3.2  and  7.3.3  can  be  easily  derived.  The  only  exceptions  are  the  structural  risk 
minimization  results  which  we  give  in  the  same  form  as  Corollary  7.3.3. 

In  the  remainder  of  this  section  we  prove  Theorem  7.3.1  We  start  with  a  lemma  that  is  key  to  our 
analysis. 

Lemma  7.3.4  Given  S,  an  offer  g  satisfying  0  <  g{i)  <  h  for  all  i  G  S,  and  a  profit  level  p,  if  we 
randomly  partition  S  into  Si  and  S2,  then  the  probability  that  |p(S'i)  —  fl'(*S'2)|  >  e max  [5(5), p]  is  at 

_  ffp 

most  2e  . 


Proof:  Let  Li, . . . ,  be  i.i.d.  random  variables  that  define  fhe  parfifion  of  S  info  and  82'-  thaf  is,  Yi 
is  1  wifh  probabilify  ^  and  is  2  wifh  probabilify  Lef  t{Yi, ...,  Yn)  =  5(0 ■  ^  random 

variable,  g{Si)  =  t(Yi, ...,  Yn)  and  clearly  E[f(Yi, ...,  Yn)]  =  Assume  firsl  fhaf  g{S)  >  p.  From  fhe 
McDiarmid  concenfrafion  inequalify  (see  Theorem  A.3.1  in  Appendix  A.3),  by  plugging  in  c*  =  g{i),  we 
get: 


Pr 


9{Si)  - 


9{S) 


e  ^  I  -ke^a{sf/Yg{ff 
>  ^g{S)  \  <  2e 


Since 

n  n 

^g{i)‘^  <max{g{i)}^g{i)  <  hg{S), 
i=l  i=l 


we  obtain: 


Pr 


9{Si) 


9{S) 

2 


>  |s{S)} 


<  2e 


2h 


Moreover,  since  g{Si)  +  g{S2)  =  g{S)  and  g{S)  >  p,  we  obtain: 


PT{\g{Si)  -  g{S2)\  >  eg{S)}  <  2e-^V(2M^ 


as  desired.  Consider  now  the  case  that  g{S)  <  p.  Again,  using  the  McDiarmid  inequality  we  have 


Pr{\g{Si)-g{S2)\>ep}<2e 


n 

I^V/  E 

i  =  l 


Since  XlILi  5(0^  ^  ^9{S)  <  ph  we  obtain  again  that 


Pi{\g{Si)-g{S2)\>ep}<2e 

which  gives  us  the  desired  bound.  ■ 

It  is  worth  noting  that  using  tail  inequalities  that  depend  on  the  maximum  range  of  the  random  vari¬ 
ables  rather  than  the  sum  of  their  squares  in  the  proof  of  Lemma  7.3.4  would  increase  the  h  to  an  hf  in  the 
exponent.  Note  also  that  if  g{i)  =  g'{i)  for  alH  G  S'  then  they  are  equivalent  from  the  point  of  view  of  the 
auction;  we  will  use  |^|  to  denote  the  number  of  different  such  offers  in  Q?  Lemma  7.3.4  implies  that: 

^Notice  that  in  our  generic  reduction,  \  Q\  only  appears  in  the  analysis  and  we  do  not  actually  have  to  know  whether  two  offers 
are  equivalent  with  respect  to  S  when  running  the  auction. 


150 


Corollary  7.3.5  For  a  random  partition  of  S  into  5i  and  S2,  with  probability  at  least  1  —  5,  all  offers  g 
in  Q  such  that  g{S)  >  ^  In  satisfy  |5(S'i)  —  fl'(-S'2)|  <  cg{S). 

Proof:  Follows  from  Lemma  7.3.4  by  plugging  mp  =  In  and  then  using  the  union  bound  over 

all  g  €  G-  ■ 

We  complete  this  section  with  the  proof  of  the  main  theorem. 

Proof  of  Theorem  7.3.1-  Let  gi  be  the  offer  in  G  produced  by  A  over  Si  and  52  be  the  offer  in  G  produced 
by  A  over  ^2.  Let  goPT  be  the  optimal  offer  in  G  over  5;  so  5opt(<S')  =  OPTp.  Since  the  optimal 
offer  over  Si  is  at  least  as  good  as  poPT  on  5i  (and  likewise  for  S'2),  the  fact  that  ^  is  a  /^-approximation 
implies  that  gi{Si)  >  ^  gopAS^)  _ 

Let  p  =  ^  In  Using  Lemma  7.3.4  (applying  the  union  bound  over  all  g  £  G),  we  have 

that  with  probability  1  —  5,  every  g  £  G  satisfies  |p(5i)  —  (7(5'2)|  <  |  max  [p(S'),p].  In  particular, 
gi{S2)  >  gi{Si)  -  §max[pi(5),p],  and  52(5'!)  >  g2{S2)  -  ^max[g2{S),p]. 

Since  the  theorem  assumes  that  OPTp  >  Pp,  summing  the  above  two  inequalities  and  performing  a 
case  analysis^  we  get  that  the  profit  of  RSO(p  ^4),  namely  the  sum  gi{S2)  +  g2{Si),  is  at  least  (1  —  e)  ■ 
More  specifically,  assume  first  that  gi{S)  >  p  and  g2{S)  >  p.  This  implies  that 


and  therefore 


5i(<S'2)  >  fi'i('S'i)  -  ^9i{S) 


and  P2(*S'i)  >  g2{S2)  -  ^92(3), 


(1  +  |)5i(52)  >  (1  -  l)9i{Si) 


and  (1  +  |)52(5'i)  >  (1  -  ^)p2(5'2). 


So,  the  profit  of  RSO(g  ^4)  in  this  case  is  at  least 


+52(52))  > 


1  -  I  OPTp 

1  +  i  P 


>(1 


6) 


OPTp 

P 


If  both  gi{S)  <  p  and  92{S)  <  p,  then  51(52)  >  5i(5i)  —  |p  and  52(5i)  >  52(52)  —  |p,  and  so  the 
profit  of  RSO(p  _4)  in  this  case  is  at  least  —  yP  which  is  at  least  (1  —  e)  by  our  assumption 

that  OPTp  >  Pp. 

Finally,  assume  without  loss  of  generality  that  51  (5)  >p  and  52(5)  <  p.  This  implies  that 


51(52)  >  5i(5i)  -  |5i(5)  and  52(5i)  >  52(52)  -  ^p. 

The  former  inequality  implies  that  (1  +  |)5i(52)  >  (1  —  |)5i(5'i),  and  so  51(52)  >  (l  —  y)  5i(5i), 
and  the  latter  inequality  implies  that  52  (5i)  >  52(52)  —  f  ■  Together  we  have  that 


51(52)  +  52(5i)  > 


i-?s 

3 


5opt(5i)  5opt(52) 


P 


P 


cOPTg^,^  ,OPTg 

3  P  ’  P  '■ 


as  desired.  ■ 


^Note  that  if  (3 


1,  then  the  conclusion  follows  easily.  The  case  analysis  is  only  need  to  deal  with  the  case  /?  >  1. 


151 


7.3.2  Structural  Risk  Minimization 

In  many  natural  cases,  Q  consists  of  offers  at  different  “levels  of  complexity”  k.  In  the  case  of  attribute 
auctions,  for  instance,  Q  could  be  an  offer  class  induced  by  pricing  functions  that  partition  bidders  into 
k  markets  and  offer  a  constant  price  in  each  market,  for  different  values  of  k.  The  larger  k  is  the  more 
complex  the  offer  is.  One  natural  approach  to  such  a  setting  is  to  perform  structural  risk  minimization 
(SRM):  that  is,  to  assign  a  penalty  term  to  offers  based  on  their  complexity  and  then  to  run  a  version 
of  RSO(g  in  which  A  optimizes  profit  minus  penalty.  Specifically,  lef  ^  be  a  series  of  offers  classes 
Qi,Q2,  .  ■ and  lef  pen  be  a  penalfy  function  defined  over  fhese  classes.  We  fhen  define  fhe  procedure 
RSO-SRM(g  as  follows: 

1.  Randomly  parfifion  fhe  bidders  info  fwo  sefs.  Si  and  S2,  by  flipping  fair  coin  for  each  bidder. 

2.  Compufe  gi  fo  maximize  max^  max^gg^,  [^(S'l)  —  pen(^fc)]  and  similarly  compufe  52  from  S2- 

3.  Use  fhe  offer  gi  for  bidders  in  S2  and  fhe  offer  g2  for  bidders  in  5i. 

We  can  now  derive  a  guarantee  for  fhe  RSO-SRM^g  p^^^  mechanism  as  follows: 

Theorem  7.3.6  Assuming  that  we  have  an  algorithm  for  solving  the  optimization  problem  required  by 
RSO-SRMf^g  pg^^,  then  for  any  given  value  of  n,  e,  and  5,  with  probability  at  least  1  —  5,  the  revenue  of 

RSC?-SRM(g  pg^) /or  pen ^  In  is  at  least 

max  ([(1  -  e)  OPTfc  -2pen(0fc)]), 
k 

where  hj.  is  the  maximum  payoff  from  and  OPT^.  =  OPTp^,. 

Proof:  Using  Corollary  7.3.5  and  a  union  bound  over  fhe  values  5k  =  <5/(4A:^),  we  obfain  fhaf  wifh  proba- 
bilify  af  leasf  1— <5,  simulfaneously  for  all  A:  andfor  all  offers  5  in  suchfhalp(5)  >  ^^\'n.{Sk‘^\Qk\/ 5)  = 
pen(^fc),  we  have  |(7(5'i)  —  g{S2)  \  <  ^g{S).  Lef  k*  be  fhe  optimal  index,  namely  lef  k*  be  fhe  index  such 
fhaf 

(1  -  e)  OPTfc*  -2pen(^fc.)  =  max((l  -  e)  OPT^  -2pen(^fc)), 

k 

and  lef  fcj  be  fhe  index  of  fhe  besf  offer  (according  fo  our  criferion)  over  Si,  for  z  =  1,  2.  By  our  assumpfion 
fhaf  <71  and  <72  were  chosen  by  an  opfimal  algorifhm,  we  have 

-  pen(^fcj  >  poPTfc*(-S'i)  -  pen(^fc*),  for  i  =  1,2. 

We  will  argue  nexf  fhaf  gi{S2)  >  (ffOPT^*  (5'i)  -  pen(^fc*)).  Firsf,  if  5i(5'i)  <  pen(^fci),  then 
fhe  conclusion  is  clear  since  we  have 

0  >  5i(5'i)  -  pen(^fci)  >  5opT;,* (^i)  -  pen(^fc*)- 

IfgiiSi)  >  pen(^fcj^),  fhen  as  argued  above  we  have  |pi(S'i)  —  (7i(5'2)|  <  ^gi{S)  and  so 

5i(5’2)  >  Yf:T9i{Si)  >  (ffoPTfe*(5'i)  -  pen(gfc*))  • 

^ _  £ 

Similarly,  we  can  prove  fhaf  we  have  g2{Si)  >  (<7opTj,*  {S2)  —  pen(^fc*)).  All  fhese  fogefher  imply 

fhaf  fhe  profif  of  fhe  mechanism  RSO-SRM^g  pg^^,  namely  gi{S2)  +  g2{Si),  is  af  leasf 

{goPT^,*  (S)  -  2pen{gk* ))  >  ((1  -  e)  OPTfc.  -2pen(afc* ))  > 

as  desired.  ■ 


152 


7.3.3  Improving  the  Bounds 


The  results  above  say,  in  essence,  that  if  we  have  enough  bidders  so  that  the  optimal  profit  is  large  com¬ 
pared  to  ^log(l^l),  then  our  mechanism  will  perform  nearly  as  well  as  the  best  offer  in  Q.  In  these 
bounds,  one  should  think  of  log(|^|)  as  a  measure  of  the  complexity  of  the  offer  class  Q',  for  instance, 
it  can  be  thought  of  as  the  number  of  bits  needed  to  describe  a  typical  offer  in  that  class.  However,  in 
many  cases  one  can  achieve  a  better  bound  by  adapting  techniques  developed  for  analyzing  generaliza¬ 
tion  performance  in  machine  learning  theory.  In  this  section,  we  discuss  a  number  of  such  methods  that 
can  produce  better  bounds.  These  include  both  analysis  techniques  (such  as  using  appropriate  forms  of 
covering  numbers),  where  we  do  not  change  the  mechanism  but  instead  provide  a  stronger  guarantee,  and 
design  techniques  (like  discretizing),  where  we  modify  the  mechanism  to  produce  a  better  bound. 

Discretizing 

Notation:  Given  a  class  of  offers  Q,  define  Qa  to  be  the  set  of  offers  induced  by  rounding  all  prices  down 
to  the  nearest  power  of  ( 1  -|-  a) . 

In  many  cases,  we  can  greatly  reduce  \Q\  without  much  affecting  OPTp  by  performing  some  type  of 
discretization.  For  instance,  for  auctioning  a  digital  good,  there  are  infinitely  many  offers  induced  by  all 
take-it-or-leave-it  prices  but  only  logi_,_Q  /i  «  -Inh  offers  induced  by  the  discretized  prices  at  powers  of 
1  +  a.  Also,  since  rounding  down  the  optimal  price  to  the  nearest  power  of  1  -|-  a  can  reduce  revenue 
for  this  auction  by  at  most  a  factor  of  1  -|-  a,  the  optimal  offer  in  the  discretized  class  must  be  close,  in 
terms  of  total  profit,  to  the  optimal  offer  in  the  original  class.  More  generally,  if  we  can  find  a  smaller 
offer  class  Q'  such  that  OPTp/  is  guaranteed  to  be  close  to  OPTq,  then  we  can  instruct  our  algorithm 
A  to  optimize  over  Q'  instead  of  Q  to  get  better  bounds.  We  consider  the  discretization  Qa  in  our  refined 
analysis  of  the  digital  good  auction  problem  (Section  7.4)  and  in  our  consideration  of  attribute  auctions 
(Section  7.5).  Further,  in  Section  7.6  we  discuss  an  interesting  alternative  discretization  for  item-pricing 
in  combinatorial  auctions. 

Counting  Possible  Outputs 

Suppose  we  can  argue  that  our  algorithm  A,  run  on  a  subset  of  S,  will  only  ever  output  offers  from  a 
restricted  set  C  Q.  For  example,  for  the  problem  of  auctioning  a  digital  good,  if  A  picks  the  offer 
based  on  the  optimal  take-it-or-leave-it  price  over  its  input  then  this  price  must  be  one  of  the  bids,  so 
\Qa\  <  Then,  we  can  simply  replace  \Q\  with  \Qj[\  (or  \Q_/[\  -|-  1  if  the  optimal  offer  is  not  in  Qji)  in  all 
the  above  arguments.  Formally  we  can  say  that: 

Observation  7.3.7  If  algorithm  A  run  on  any  subset  of  S,  only  output  offers  from  a  restricted  set  C  Q, 
then  all  the  bounds  in  Sections  7.3.1  and  7.3.2  hold  with  \Q\  replaced  by  \Qa\  +  1- 

Using  Covering  Numbers 

The  main  idea  of  these  arguments  is  the  following.  Suppose  Q  has  the  property  that  there  exists  a  much 
smaller  class  Q'  such  that  every  g  ^  Q  is  “close”  to  some  g'  G  Q' ,  with  respect  to  the  given  set  of  bidders 
S.  Then  one  can  show  that  if  all  offers  in  Q'  perform  similarly  on  Si  as  they  do  on  S2,  then  this  will 
be  true  for  all  offers  in  Q  as  well.  These  kind  of  arguments  are  quite  often  used  in  machine  learning 
(see  for  instance  |18,  12.  103,  203|),  but  the  main  challenge  is  to  define  the  right  notion  of  “close”  for 
our  mechanism  design  setting  to  get  good  and  meaningful  bounds.  Specifically,  we  will  consider  Li 
mulfiplicafive  7-covers  which  we  define  as  follows: 


153 


Definition  7.3.1  Q'  is  an  Li  multiplicative  'j-cover  ofQ  with  respect  to  S  if  for  every  g  €  Q  there  exists 
g'  G  Q'  such  that 

^\g{i)  -  g  {i)\  <  yg{S). 
i&S 

In  the  following  we  present  bounds  based  on  Li  multiplicative  7-covers.  We  start  by  proving  the 
following  structural  lemma  characterizing  these  Li  covers. 

Lemma  7.3.8  If'ff  |p(i)  —  g'{i)\  <  yg{S)  and\g'{Si)  —  g' {82)1  <  e' max  [5' (S'), p]  then  we  have 
ieS 

\9iS1)  -  9(82)1  <  emax[g'(8),p]  +^g(8). 

This  further  implies  that 

Ifi'('S'i)  -  fif(‘S'2)|  <  (7  +  e'(l  +  7))  max[p(S),p]. 

Proo/.' We  will  first  prove  that  <7 (Si)  >  g(82)  —  e'max[p'(S),p]  —  7p(S).  Note  that  this  clearly  implies 

g(8i)  >  g(82)  -  (7  +  e'(l  +  7))  ^sx[9(S),p\, 

since  the  first  assumption  in  the  lemma  implies  that  \g(8)  —  5^(S)|  <  75(S)  .  Let  us  define 

^9192(5')  =  ^max(5ri(z)  -  g2(i),0) 
i£S 


and  consider 

A,,,(S)  =  Aggf8)  +  Ag,g(8)  =  \  9  (i)  -  p'(i)|. 

ieS 

Clearly,  for  any  S'  C  S  we  have  Agg'(8)  >  Aggi{8')  and  likewise  Aggi(8)  >  Aggi(8').  Also,  for 
any  subsef  S'  C  S  we  have  p(S')  —  g'(8')  <  Aggi{8)  and  g\8')  —  g(8')  <  Agig(8).  Now,  from 
g'(Si)  >  9^82)  —  e'max[5'(S),p]  we  obfain  fhaf 

5r(Si)  +  Ag>g(8)  >  9(82)  -  e'max[p'(S),p]  >  9(82)  -  \g'(8)  -  e'  max[g' (8), p]. 


Therefore  we  have 

9(81)  >  9(82)  -  Agg/(S)  -  e'max[5'(S),p], 

which  implies 

9(81)  >  9(82)  -  e'max[p'(S),p]  -  75f(S), 

as  desired.  Using  fhe  same  argumenf  wifh  Si  replaced  by  82  yields  fhe  fheorem.  ■ 

Using  Lemma  7.3.8 ,  we  can  now  gef  fhe  following  bound: 

Theorem  7.3.9  Given  the  offer  class  Q  and  a  f3-approximation  algorithm  A  for  optimizing  over  Q,  then 
with  probability  at  least  1  —  5,  the  profit  ofRSO(^  yy^  is  at  least  (1  —  e)OPTp//9  so  long  as 

OPTp  >/?^ln(^), 

for  some  Li  multiplicative  j^-cover  Q'  ofQ  with  respect  to  8. 


154 


Proof:  Let  p  =  ^  In  ■  By  Lemma  7.3.4,  applying  the  union  bound,  we  have  that  with  probability 

1  —  5,  every  p'  G  Q'  satisfies  |5r'(S'i)  —  5r'(5'2)|  <  |  max  [gfS),p].  Using  Lemma  7.3.8,  with  e'  set  to  |  and 
7  set  to  we  obtain  that  with  probability  1  —  5,  every  g  £  Q  satisfies  |(7(5'i)  —  g{S2)\  <  |  max  [p(5),p]. 
Finally,  proceeding  as  in  the  proof  of  Theorem  7.3. 1  we  obtain  the  desired  result.  ■ 

Notice  that  Theorem  7.3.9  implies  that: 

Corollary  7.3.10  Given  the  offer  class  Q  and  a  ^-approximation  algorithm  A  for  optimizing  over  Q,  then 
with  probability  at  least  1  —  5,  the  profit  ofRSO(^g  jiy  is  at  least  (1  —  e)OPTg//3,  so  long  as  OPTg  >  n 
and  the  number  of  bidders  satisfies 

n  >  ^  In  (^) 

for  some  Li  multiplicative  f^-cover  Q'  ofQ  with  respect  to  S. 

We  will  demonstrate  the  utility  of  Li  multiplicative  covers  in  Section  7.4  by  showing  the  existence 
of  Li  covers  of  size  o(n)  for  the  digital  good  auction.  It  is  worth  noting  that  a  straightforward  appli¬ 
cation  of  analogous  e-cover  results  in  learning  theory  [  1 8 1  (which  would  require  an  additive,  rather  than 
multiplicative  gap  of  e  for  every  bidder)  would  add  an  extra  factor  of  h  into  our  sample-size  bounds. 


7.4  The  Digital  Good  Auction 

We  now  consider  applying  the  results  in  Section  7.3  to  the  problem  of  auctioning  a  digital  good  to  indis¬ 
tinguishable  bidders.  In  this  section  we  define  Q  to  be  the  natural  class  of  offers  induced  by  the  set  of  all 
take-it-or-leave-it  prices  (see  for  instance  [124J).  Clearly  in  this  case,  it  is  trivial  to  solve  the  underlying 
optimization  problem  optimally:  given  a  set  of  bidders,  just  output  the  offer  induced  by  the  constant  price 
that  maximizes  the  price  times  the  number  of  bidders  with  bids  at  least  as  high  as  the  price.  Also,  it  is 
easy  to  see  that  this  price  will  be  one  of  the  bid  values.  Thus,  applying  Theorem  7.3.7  with  the  bound  on 
\Ga\  =  we  get  an  approximately  optimal  auction  with  convergence  rate  0(/i log n). 

We  can  obtain  better  results  using  Li  multiplicative-cover  arguments  and  Theorem  7.3.9  as  follows. 
Let  bi, ...  An  be  the  bids  of  the  n  bidders  sorted  from  highest  to  lowest.  Define  Q'  as  the  offer  class 
induced  by  {6i  :  i  =  |_(l-|-7)'tj  for  some  j  G  Z}  U  {(1 -|- 7)*  :  i  G  {1, . . . ,  logi_,_,.^ /i}}.  Consider 
g  £  Q  and  find  fhe  g'  £  Q'  fhat  offers  fhe  largest  price  less  than  the  offer  price  of  g.  Notice  first  that 
all  the  winners  in  S'  on  gr  also  win  in  g' .  Second,  the  offer  price  of  g'  is  within  a  factor  of  1  -|-  7  of  the 
offer  price  of  g.  Third,  g'  has  at  most  a  factor  of  1  -|-  7  more  winners  than  g.  The  first  two  facts  above 
imply  that  AggfS)  <  75(S').  The  third  fact  implies  that  Ag/g{S)  <  ygiS).  Thus,  Aggi  <  2yg{S)  and 
therefore,  Q'  is  a  27-cover  of  Q  (see  the  proof  of  Lemma  7.3.8  for  definitions  of  Aggi  and  Agg>).  Since 
\Q'\  is  Olfoghn),  the  additive  loss  of  RSO(q_4)  is  0(/i  log  log 

We  can  also  apply  the  discretization  technique  by  defining  Qa  to  be  the  set  of  offers  induced  by  the 
set  of  all  constant-price  functions  whose  price  v  £  [1,  /i]  is  a  power  of  (1  -|-  a)  and  a  =  |.  Clearly,  if  we 
can  get  revenue  at  least  (1  —  |)  times  the  optimal  in  this  class,  we  will  be  within  (1  —  e)  of  the  optimal 
fixed  price  overall.  For  example.  Corollary  7.3.2  (.4  can  trivially  find  the  best  offer  in  Q'  by  simply  trying 
all  of  them)  shows  that  with  probability  1  —  5  we  get  at  least  1  —  e  times  the  revenue  of  the  optimal 
take-it-or-leave-it  offer  so  long  as  the  number  of  bidders  n  is  at  least  ^  ln(^^^)  =  0(/i  log  log  h). 

''it  is  interesting  to  contrast  these  results  with  that  of  1 121 1  which  showed  that  RSO  over  the  set  of  constant-price  functions  is 
near  6-competitive  with  the  promise  that  h. 


155 


7.4.1  Data  Dependent  Bounds 

We  can  use  the  high  level  idea  of  our  structural  risk  minimization  reduction  in  order  to  get  a  better  data 
dependent  bound  for  the  digital  good  auction.  In  particular,  we  can  replace  the  “/i”  term  in  the  additive 
loss  with  the  actual  sale  price  used  by  the  optimal  take-it-or-leave-it  offer  (in  fact,  even  better,  the  lowest 
sales  price  needed  to  generate  near-optimal  revenue),  yielding  a  much  better  bound  when  most  of  the 
profit  to  be  made  is  from  the  low  bids.  The  idea  is  that  rather  than  penalizing  the  “complexity”  of  the  offer 
in  the  usual  sense,  we  instead  penalize  the  use  of  higher  prices. 

Let  qi  =  {1  +  ay  and  offer  gi  be  the  take-it-or-leave-it  price  of  qi.  Define  Q  =  {51},  {92},  ■  ■  ■  and 
consider  the  auction  RSO-SRMg  with  pen({g'j})  specified  from  Section  7.3.2  to  be  ^  In  The 

following  is  an  a  corollary  of  of  Theorem  7.3.6 

Corollary  7.4.1  For  any  given  value  ofn,  e,  and  6,  with  probability  1  —  5,  the  revenue  ofRSO-SRMf^g  ^^^'^ 
is  at  least  [(1  -  e)gi{S)  -  2pen({pj})],  w/iere  pen({5rj})  =  ^  In 

In  other  words,  if  the  optimal  take-it-or-leave-it  offer  has  a  sale  price  of  p,  then  RSO-SRM^^g  has 
convergence  rate  bounded  by  0{p  log  log  h)  instead  olO{h  log  log  h)  as  provided  by  our  generic  analysis 
of  RSO(p  _4). 


7.4.2  A  Special  Purpose  Analysis  for  the  Digital  Good  Auction 


In  this  section  we  present  a  refined  data  independent  analysis  for  the  digital  good  auction.  Specifically,  we 
can  show  for  an  optimal  algorithm  A,  that: 

Theorem  7.4.2  For  5<\,  with  probability  1  —  5,  RSOi^g^^^-^  obtains  profit  at  least 


OPTp„  -8^hOPTp„  log(^). 


Corollary  7.4.3  For  5  <  ^  and  a  =  ^,  so  long  as  OPTp^  >  (^)^/ilog  then  with  probability  at 
least  1  —  5,  the  profit  of  RSO(^g^  jy^  is  at  least  (1  —  e)  OPTp. 

The  above  corollary  improves  over  our  basic  discretization  results  using  Theorem  7.3. 1  by  an  0(log  log  h) 
factor  in  the  convergence  rate. 

To  prove  Theorem  7.4.2,  let  us  introduce  some  notation.  For  the  offer  g^  induced  by  the  take-it-or- 
leave-it  offer  of  price  v,  let  Uy  denote  the  number  of  winners  (bidders  whose  value  is  at  least  v),  and  let 
ry  =  V  ■  Uy  denote  the  profit  of  gy  on  S.  Denote  by  hy  the  observed  profit  of  gy  on  Si  (and  sovy  =  v  ■  fiy, 
where  n„  is  the  number  of  winners  in  Si  for  gy).  So,  we  have  We  now  begin  with  the 

following  lemma. 

Lemma  7.4.4  Let  e  <  1  and  5  <  ^.  With  probability  at  least  1  —  5  we  have  that,  for  every  gy  G  Qa  the 
observed  profit  on  Si  satisfies: 


Ty  - 


2 


<  max 


er„ 


Proof:  First  for  a  given  price  v  let  a^^y  be  |n^  —  ^|.  To  prove  our  lemma  we  will  use  the  consequence 
of  Chernoff  bound  we  present  in  Appendix  A.3,  Theorem  A.3.2.  For  any  v  and  j  >  1  we  consider 

,  (l-l-a)JIog  (  A) 

n  = - 3 — and  so  we  get 


Pr 


^n,v  >  emax 


(l  +  a)^'log(^)^ 


156 


This  further  implies  that  we  have  an,v  >  e  max  (  n 
Therefore  for  v  =  ,,  x,-  we  have 


,  j  with  probability  at  most 


Pr 


r„  - 


Tv 


>  max  \  < 


and  so  the  probability  that  there  exists  a  gv  ^  Ga  such  that  —  ^|  >  max(^,er„)  is  at  most 
2^Aa6)^G+o‘y  <  2^-,  <  6.  This  implies  that  with  high  probability,  at  least  1  —  <5,  we 


have  that  simultaneously,  for  every  e  Ga  the  observed  revenue  on  5i  satisfies: 


r„  - 


as  desired.  ■ 

Proof  of  Theorem  7.4.2-  Assume  now  that  it  is  the  case  that  for  every  g^  G  Ga  we  have 


r„  -  — 


,  H 

<  max  (  — ,  evy  j , 


where  H  =  h  log  (^) .  Let  v*  be  the  optimal  price  level  among  prices  in  Ga,  and  let  v*  be  the  price  that 
looks  best  on  5i.  Obviously,  our  gain  on  S2  is  r^*  —  We  have 

K  H  l-2e  H 

Tv*  >  ^ - evv*  =  — - - , 

,  ^  r^*  H  ra*  H 

Tv*  >  ,  and  <  —  H - h  er^*  <  ^  H - h  er^* , 

and  therefore  r^*  —  r^*  >  r^*  —  —  —  er^* ,  which  finally  implies  fhaf 


,1  \  H 

r V*  r^*  A.  V*  (  2  1  ^  * 


This  implies  fhaf  wifh  probabilify  af  leasf  1  —  5  our  gain  on  S2  is  af  leasf  r^*  —  2e)  —2^,  and  similarly 

our  gain  on  Si  is  af  leasf  Vy*  —  2e)  —  2^-  Therefore,  wifh  probabilify  1  —  5,  our  revenue 


IS 


OPTp„(l-46)-4 


Optimizing  fhe  bound  we  sef  e  =  y  ^oPTg^ ^  ^  r^'^^aue  of 


OPTg^-mhOPTg^ 


which  complefes  fhe  proof.  ■ 


157 


7.5  Attribute  Auctions 


We  now  consider  applying  our  general  bounds  (Section  7.3 )  to  attribute  auctions.  For  attribute  auctions  an 
offer  is  a  function  from  the  publicly  observable  attribute  of  an  agent  to  a  take-it-or-leave-it  price.  As  such, 
we  identify  such  an  offer  with  its  pricing  function.  We  begin  by  instantiating  the  results  in  Section  7.3 
for  market  pricing  auctions,  in  which  we  consider  pricing  functions  that  partition  the  attribute  space  into 
market  segments  and  offer  a  fixed  price  in  each.  We  show  how  one  can  use  standard  combinatorial 
dimensions  in  learning  theory,  e.g.  the  Vapnik-Chervonenkis  (VC)  dimension  [  18,  69,  103,  149,  203|,  in 
order  to  bound  the  complexity  of  these  classes  of  offers.  We  then  give  an  analysis  for  very  general  offer 
classes  induced  by  general  pricing  functions  over  the  attribute  space  that  uses  the  notion  of  covers  defined 
in  Section  7.3.3. 


7.5.1  Market  Pricing 


For  attribute  auctions,  one  natural  class  of  pricing  functions  are  those  that  segment  bidders  into  markets 
in  some  simple  way  and  then  offer  a  single  sale  price  in  each  market  segment.  For  example,  suppose  we 
define  Vk  to  be  the  set  of  functions  that  choose  k  bidders  6i , . . . ,  6^ ;  use  these  as  cluster  centers  to  partition 
S  into  k  markets  based  on  distance  to  the  nearest  center  in  attribute  space;  and  then  offer  a  single  price  in 
each  market.  In  that  case,  if  we  discretize  prices  to  powers  of  (1  +  e),  then  clearly  the  number  of  functions 
in  the  offer  class  Qk  induced  by  the  pricing  class  Vk,  is  at  most  n^(logi_,_g  h)^,  so  Corollary  1.3.2  implies 
that  so  long  as  n  >  ^  [^^  (|)  +  kin  n  +  kin  (logi_,_g  h)]  and  assuming  we  can  solve  the  optimization 
problem,  then  with  probability  at  least  1  —  <5,  we  can  get  profit  at  least  (1  —  e)  OPT^^,. 

We  can  also  consider  more  general  ways  of  defining  markets.  Let  C  be  any  class  of  subsets  of  X, 
which  we  will  call  feasible  markets.  For  k  a  positive  integer,  we  consider  Fk+i{C)  to  be  the  set  of  all 
pricing  functions  of  the  following  form:  pick  k  disjoint  subsets  Xi,...,Xk  V  X  from  C,  and  A:  +  1  prices 
Po,...,Pk  discretized  to  powers  of  1  +  e.  Assign  price  pi  to  bidders  in  A),  and  price  po  to  bidders  not  in 
any  of  Xi,...,Xk.  For  example,  if  A  =  a  natural  C  might  be  the  set  of  axis-parallel  rectangles  in  TZ^. 
The  specific  case  of  d  =  1  was  studied  in  [60|.  One  can  envision  more  complex  partitions,  using  the 
membership  of  a  bidder  in  A)  as  a  basic  predicate,  and  constructing  any  function  over  it  (e.g.,  a  decision 
list). 

We  can  apply  the  results  in  Section  7.3  by  using  the  machinery  of  VC-dimension  to  count  the  number 
of  distinct  such  functions  over  any  given  set  of  bidders  S.  In  particular,  let  D  =  VCdim(C')  be  the  VC- 
dimension  of  C  and  assume  D  <  oo.  Define  C[5]  to  be  the  number  of  distinct  subsets  of  S  induced  by 
C.  Then,  from  Sauer’s  Lemma  C'[S']  <  (^)^,  and  therefore  the  number  of  different  pricing  functions  in 
Fk{C)  over  S  is  at  most  (logi_,_g  Thus  applying  Corollary  7.3.2  here  we  get: 

Corollary  7.5.1  Given  a  ^-approximation  algorithm  A  for  optimizing  over  the  offer  class  Qk  induced  by 
the  class  of  pricing  functions  Fk{C),  then  so  long  as  OPTp^  >  n  and  the  number  of  bidders  n  satisfies 


n  > 


18hp 


In  0^  +kln  01n0  +kDln(J^'^ 


then  with  probability  at  least  1  —  6,  the  profit  of  is  at  least  (1  —  e)  ■ 

The  above  lemma  has  “n”  on  both  sides  of  the  inequality.  Simple  algebra  yields: 

Corollary  7.5.2  Given  a  fd-approximation  algorithm  A  for  optimizing  over  the  offer  class  Qk  induced  by 
the  class  of  pricing  functions  Fk{C),  then  so  long  as  OPTpj,  >  n  and  the  number  of  bidders  n  satisfies 


n  > 


36hp 


-  a2 


In  (  -  j  kin  (- In  h 


158 


OPTS, 


—  2 


then  with  probability  at  least  1  —  5,  the  profit  of  is  at  least  (1  —  e)  — ^ 

Proof:  Since  In  o  <  a6  —  In  i 
Therefore,  it  suffices  to  have: 

+kln(-lnh^+kDlny 


(  36kDh(3\ 


n  18hp 
n>  -  -\ - ^ 


^  36khp5^ 


so 


suffices. 


+  A:ln 


+  kD  In 


36khp\ 


For  certain  classes  C  we  can  get  better  bounds.  In  the  following,  denote  by  Ck  the  concept  class 
of  unions  of  at  most  k  sets  from  C,  and  let  L  be  [logi_,_g  K\.  If  C  is  the  class  of  intervals  on  the  line, 
then  the  VC-dimension  of  is  2k,  and  so  the  number  of  different  pricing  functions  in  Fk{C)  over  S 
is  at  most  [^)  ;  also,  if  C  is  the  class  of  all  axis  parallel  rectangles  in  d  dimensions,  then  the  VC- 

dimension  of  Ck  is  0{kd)  [  107 1.  In  these  cases  we  can  remove  the  log  k  term  in  our  bounds,  which  is  nice 
because  it  means  we  can  interpret  our  results  (e.g..  Corollary  7.5.2)  as  charging  OPT  a  penalty  for  each 
market  it  creates.  However,  we  do  not  know  how  to  remove  this  log  k  term  in  general,  since  in  general  the 
VC-dimension  of  Cfc  can  be  as  large  as  2Dk\og{2Dk)  (see  [57,  102 j). 

Corollary  7.5.2  gives  a  guarantee  in  the  revenue  of  so  long  as  we  have  enough  bidders.  In 

the  following,  for  fc  >  0  let  OPT^  =  OPT^^,.  We  can  also  use  Corollaries  7.3.5  and  7.5.2  to  show  a 
bound  that  holds  for  all  n,  but  with  an  additive  loss  term. 

Theorem  7.5.3  For  any  given  value  of  n,  k,  e,  and  5,  with  probability  at  least  1  —  5,  the  revenue  of 
RSOg^^^A 

\[{l-e)OPFk-h-rF{k,D,h,t,5)], 

whererF{k,D,h,e,5)  =  O 

Proof:  For  simplicity,  we  show  the  proof  for  /?  =  1,  the  general  case  is  similar.  We  prove  the  bound  with 
the  “(1  —  e)”  term  replaced  by  the  term  min  1  —  ,  which  then  implies  our  desired  result  by 

simply  using  e'  =  | .  If 


n  > 


36/i 


+  kin 


-|-  kD\n 


36kh\ 


then  the  desired  statement  follows  directly  from  Corollary  7.5.2.  Otherwise,  consider  first  the  case  when 
we  have 


OPTfe  > 


4h 

e'2(l  -  P) 


ln0^  +kln(^^lnh^  +  kD  In 


Let  Pi  be  the  optimal  offer  in  Qk  over  Si,  for  f  =  1, 2,  and  let  ^opt  bo  the  optimal  offer  in  Qk  over  S  (and 
so  gfiSi)  >  PoPT(*S'i))-  From  Corollary  7.3.5,  we  have 


So, 


50PT(5'i)  > 


2h 


ln0^  +kln(^^lnh'^  +kDln(J^'^ 


for  i  =  1,  2. 


QiiSi)  > 


2h 


./2 


+  kin 


+  kD  In 


159 


Using  again  Corollary  7.3.5,  we  obtain  gi{Sj)  >  \^gi{Si)  for  j  /  i,  which  then  implies  the  desired 
result.  To  complete  the  proof  notice  that  if  both 


OPTfc  < 


Ah 

e'2(l  -e') 


In  +  fcln  In  +  kD  In 


and 

ln0^  +A:ln(^^ln/r^  +A;Z71n(^^^  , 
then  we  easily  get  the  desired  statement.  ■ 

Finally,  as  in  Theorem  7.3.6  we  can  extend  our  results  to  use  structural  risk  minimization,  where  we 
want  the  algorithm  to  optimize  over  k,  by  viewing  the  additive  loss  term,  h  •  rF{-),  as  a  penalty  function. 

Theorem  7.5.4  Let  Q  be  the  sequence  Gi,G2,  ■  ■  ■  ,Gn  of  offer  classes  induced  by  the  sequence  of  classes 
of  pricing  functions  Fi(C),  F2{C), . . . ,  FffC).  Then  for  any  value  of  n,  e  and  8  with  probability  1  —  5 
the  revenue  of  RSO-SRMq  is 

max  ((1  —  e)  OPT^.  —h  ■  rp^k,  D,  h,  €,5)), 


Ah 
n  <  ^ 
-  e'2 


where  pen{Fk{C))  =  §  •  rF{k,D,h,e,5)  =  O  (^In  (^)). 

To  illustrate  the  tightness  of  Theorem  7.5.3,  notice  that  even  for  the  special  case  of  pricing  using 
interval  functions  (the  case  of  d  =  1  studied  in  [60  |),  the  following  lower  bound  holds. 

Theorem  7.5.5  Let  X  =  TZ  and  let  be  the  class  of  k  intervals  over  X.  Then  there  is  no  incentive 
compatible  mechanism  whose  expected  revenue  is  at  least  |  OPT^  —o{kh). 

That  is,  an  additive  loss  linear  in  kh  is  necessary  in  order  to  achieve  a  multiplicative  ratio  of  at  least  3/4. 
Proof:  Consider  ^  bidders  with  distinct  attributes  (for  instance,  say  bidder  i  has  attribute  i),  each  of 
whom  independently  has  a  ^  probability  of  having  valuation  h  and  al  —  probability  of  having  valuation 
1.  Then,  any  incentive-compatible  mechanism  has  expected  profit  at  most  ^  because  for  any  given  bidder 
and  any  given  proposed  price,  the  expected  profit  (over  randomization  in  the  bidder’s  valuation)  is  at  most 
1.  However,  there  is  at  least  a  50%  chance  we  will  have  at  least  |  bidders  of  valuation  h,  and  in  that 
case  OPTfc  can  give  |  —  1  of  those  bidders  a  price  of  h  and  the  rest  a  price  of  1  for  an  expected  profit 
of  (I  —  l)  /i  +  |  +  =  —  /i  —  1  +  1.  On  the  other  hand  even  if  that  does  not  occur,  we 

always  have  OPT^.  >  So,  the  expected  profit  of  OPT^  is  at  least  3^  “  t  “  +  Thus,  the  profit  of  the 
incentive-compatible  mechanism  is  at  most  |  OPT^.  —  f|  +  o{kh).  ■ 

We  note  that  a  similar  lower  bound  holds  for  most  base  classes.  Also  for  the  case  of  intervals  on  the 
line,  both  our  auction  and  the  auction  in  |60|  match  this  lower  bound  up  to  constant  factors. 


7.5.2  General  Pricing  Functions  over  the  Attribute  Space 

In  this  section  we  generalize  the  results  in  Section  7.5.1  in  two  ways:  we  consider  general  classes  of 
pricing  functions  (not  just  piecewise-constant  functions  defined  over  markefs),  and  we  remove  fhe  need  fo 
discretize  by  insfead  using  fhe  covering  argumenfs  discussed  in  Section  7.3.3  This  allows  us  fo  consider 
offers  based  on  linear  or  quadratic  functions  of  fhe  affribules,  or  perhaps  funclions  fhaf  divide  fhe  aflribule 
space  info  markefs  and  use  pricing  functions  are  linear  in  fhe  affribules  (ralher  lhan  conslanl)  in  each 
markef.  The  key  poinf  of  Ihis  section  is  fhaf  we  can  bound  fhe  size  of  fhe  Li  mulfiplic alive  cover  in  an 
alfribule  aucfion  in  lerms  of  nafural  quanlifies. 


160 


Assume  in  the  following  that  X  C  TZ'^,  let  "P  be  a  fixed  class  of  pricing  functions  over  the  attribute 
space  X  and  let  Q  be  the  induced  class  of  offers.  Let  Vd  be  the  class  of  decision  surfaces  (in 
induced  by  P:  that  is,  to  each  g  G  P  we  associate  the  set  of  all  (x,  v)  ^  X  x  [l,h]  such  that  q{x)  <  v. 
Also,  let  us  denote  by  D  the  VC-dimension  of  class  Vd-  We  can  then  show  that: 

Theorem  7.5.6  Given  the  offer  class  Q  and  a  fd-approxiniation  algorithm  A  for  optimizing  over  Q,  then 
so  long  as  OPTp  >  n  and  the  number  of  bidders  n  satisfies 


n  > 


154/i/3 


-  .2 


In 


2\  /l54/i/3/l2,  ,  ^ 

+  P  In  (  ^ —  (  —  In  /i  +  1 


then  with  probability  at  least  1  —  5,  the  profit  ofRSO(^g  jy^  is  at  least  (1  —  e) 

The  key  to  the  proof  is  to  exhibit  an  Li  multiplicative  cover  of  Q  whose  size  is  exponential  in  D  only,  and 
then  to  apply  Corollary  7.3.10 

Proof:  Let  a  =  For  each  bidder  {x,v)  we  conceptually  introduce  0(^ln/i)  “phantom  bidders” 
having  the  same  attribute  value  x  and  bid  values  1,  (1  +  a),  (1  +  a)^,  •  •  •  ,h.  Let  S*  be  the  set  S  together 
with  the  set  of  all  phantom  bidders;  let  n*  =  IS*!.  Let  Split  be  the  set  of  possible  splittings  of  S*  with 
surfaces  from  Vd-  We  clearly  have  |Split|  <  Vd[S*]-  For  each  element  s  G  Split  consider  a  representative 
function  in  Q  that  induces  splitting  s  in  terms  of  its  winning  bidders,  and  let  Splltg  be  the  set  of  these 
representative  functions.  Let  Q'  be  the  offer  class  induced  by  the  pricing  class  Splltg.  Notice  that  Q'  is 
actually  an  Li  multiplicative  a-cover  for  Q  with  respect  to  S,  since  for  every  offer  in  Q  there  is  a  offer  in 
G'  that  extracts  nearly  the  same  profit  from  every  bidder;  i.e.,  for  every  offer  in  there  exists  g'  G  G' 

such  that  for  every  (x,  v)  G  S,  we  have  both 


g'{{x,v))  <{l  +  a)g{{x,v))  and  p((x, x))  <  (1  +  a)5r'((x, x)). 

From  Sauer’s  lemma  we  know  |Splitg|  <  and  applying  Corollary  7.3.10,  we  finally  gef  fhe 

desired  sfafemenf  by  using  simple  algebra  as  in  Corollary  7.5.2  ■ 

The  above  fheorem  is  fhe  analog  of  Corollary  7.3.2.  Using  if  and  Theorem  7.3.9,  if  is  easy  fo  derive  a 
bound  fhaf  holds  for  all  n  (i.e.,  fhe  analog  of  Theorem  7.5.3).  One  can  furlher  easily  extend  fhese  resulfs 
fo  gel  bounds  for  fhe  corresponding  SRM  auclion  (as  done  in  Theorem  7.5.4). 


7.5.3  Algorithms  for  Optimal  Pricing  Functions 

There  has  been  relatively  little  work  on  fhe  algorilhmic  queslion  of  computing  optimal  pricing  funcfions  in 
general  aflribufe  spaces.  However,  for  single-dimenlional  affribules  and  piece-wise  conslanl  pricing  func- 
lions  [60 1  discusses  an  optimal  polynomial  time  dynamic  program.  For  single-dimenfional  aflribufes  and 
monotone  pricing  funcfions,  [9  j  gives  a  polynomial  lime  dynamic  program.  The  problem  of  computing  fhe 
optimal  of  linear  pricing  funclion  over  m-dimenlional  affribules  generalizes  fhe  problem  of  ilem-pricing 
(m  disfincf  items)  for  single-minded  combinatorial  consumers  (see  Section  7.6.4)  lhal  has  been  shown  to 
be  hard  fo  approximale  fo  better  lhan  a  log‘^(m)  faclor  for  some  5  >  0  [  101  j. 


7.6  Combinatorial  Auctions 

Combinalorial  auclions  have  received  much  aflenlion  in  recenl  years  because  of  fhe  difficully  of  merging 
fhe  algorilhmic  issue  of  compuling  an  optimal  oufcome  wifh  fhe  game-fheorefic  issue  of  incenlive  com- 
pafibilify.  To  date,  fhe  focus  primarily  has  been  on  fhe  problem  of  optimizing  social  welfare:  parfifioning 


161 


a  limited  supply  of  items  among  bidders  to  maximize  the  sum  of  their  valuations.  We  consider  instead  the 
goal  of  profit  maximization  for  the  seller  in  the  case  that  the  items  for  sale  are  available  in  unlimited  sup¬ 
ply.^  We  consider  the  general  version  of  the  combinatorial  auction  problem  as  well  as  the  special  cases  of 
unit-demand  bidders  (each  bidder  desires  only  singleton  bundles)  and  single-minded  bidders  (each  bidder 
has  a  single  desired  bundle). 

It  is  interesting  to  restrict  our  attention  to  the  case  of  item-pricing,  where  the  auctioneer  intuitively  is 
attempting  to  set  a  price  for  each  of  the  distinct  items  and  bidders  then  choose  their  favorite  bundle  given 
these  prices.  Item-pricing  is  without  loss  of  generality  for  the  unit-demand  case,  and  general  bundle¬ 
pricing  can  be  realized  with  an  auction  with  m!  =  2"^  “items”,  one  for  each  of  possible  bundle  of  the 
original  m  items. ^ 

First  notice  that  if  the  set  of  allowable  item  pricings  are  constrained  to  be  integral,  Qz,  then  clearly 
there  are  at  most  \Q'z\  =  {h-\- 1)™  possible  item  pricings.  By  Corollary  7.3.2  we  get  that  O  (^)  bidders 
are  sufficient  to  achieve  profit  close  to  OPTg^-  Generally  it  is  possible  to  do  much  better  if  non-integral 
item-pricings  are  allowed,  i.e.,  OPTp(S')  OPTg^C'S').  In  these  settings  we  can  still  get  good  bounds 
following  the  guidelines  established  in  Section  7.3.3,  by  either  considering  an  offer  class  Q'  induced 
by  discretization  (see  Section  7.6.1),  or  from  counting  possible  outcomes  in  (see  Section  7.6.2).  A 
summary  of  our  results  is  given  in  Table  7.6 


Table  7.1:  Size  of  offer  classes  for  combinatorial  auctions. 


general 

unif-demand 

single-minded 

l^'l 

o(iogr+.2T) 

o(iogr+.2  7) 

O(log-^rm) 

I^aI 

n™(m+  1)^”^ 

(re  -|-  m)™' 

We  can  apply  Theorem  7.3.1  and  Corollary  7.3.2  to  the  sizes  of  the  offer  classes  in  Table  7.6  to  get 
bounds  on  the  profit  of  random  sampling  auctions  for  combinatorial  item  pricing.  In  particular,  using 
Corollary  7.3.2  we  get  that  O  bidders  are  sufficient  to  achieve  revenue  close  to  the  optimum  item¬ 

pricing  in  the  general  case,  and  O  (^)  bidders  are  sufficient  for  the  unit-demand  case.  Also,  by  using 
Theorem  7.3.1  instead  of  Corollary  7.3.2  we  can  replace  the  condition  on  the  number  of  bidders  with  a 
condition  on  OPTg,  which  gives  a  factor  of  m  improvement  on  the  bound  given  by  1 1 19|. 

As  before  we  let  h  =  maxggp^jg5  5r(f).  In  particular,  this  implies  that  OPTp  >  h  which  will  be 
important  later  in  this  section. 

7.6.1  Bounds  via  Discretization 

As  shown  in  Section  7.3.3,  we  can  obtain  good  bounds  if  we  are  willing  to  optimize  over  a  set  Q'  of 
offers  induced  by  a  small  set  of  discretized  prices  satisfying  that  OPTp/  is  close  to  OPTp.  Prior  to 
this  work,  [131|  shows  how  to  construct  discretized  classes  Q’  with  OPTp/  >  OPTp  and  size 
0{m^  logi+e  7 )  for  the  unit-demand  case  and  size  0(log^£  ^ )  for  the  single-minded  case.  Nisan  1 178 1 
gives  the  basic  argument  necessary  to  generalize  these  results  to  obtain  the  result  in  Theorem  7.6.1  which 
applies  to  combinatorial  auctions  in  general.  We  note  in  passing  that  Theorem  7.6.1  allows  for  general¬ 
ization  and  improvement  of  the  computational  results  of  1 131 1.  The  discretization  results  we  obtain  are 

^ Other  work  focusing  on  profit  maximization  in  combinatorial  auctions  include  Goldberg  and  Hartline  1 119|,  Hartline  and 
Koltun  1 131 1,  Guruswami  et  al.  1 129|,  Likhodedov  and  Sandholm  1 161 1,  and  Balcan  et  al.  |37 1. 

®We  make  the  assumption  that  all  desired  bundles  contain  at  most  one  of  each  item.  This  assumption  can  be  easily  relaxed 
and  our  results  apply  given  any  bound  on  the  number  of  copies  of  each  item  that  are  desired  by  any  one  consumer.  Of  course, 
this  reduction  produces  an  exponential  blowup  in  the  number  of  items. 


162 


summarized  in  the  first  row  of  Table  7.6. 

Let  p  =  (j?i, . . .  ,Pm)  be  an  item-pricing  of  the  m  items.  Let  gp  correspond  to  the  offering  pricing  p. 
The  following  is  the  main  result  of  this  section. 

Theorem  7.6.1  Let  k  be  the  size  of  the  maximum  desired  bundle.  Let  p'  be  the  optimal  discretized  price 
vector  that  uses  item  prices  equal  to  0  or  powers  of{l  +  e)  in  the  range  (ind  let  p*  be  the  optimal 

price  vector.  Then  we  have: 

gp,{S)>{l-2yfe)gp,{S). 

Proof:  Let  5  =  ^/e.  For  the  optimal  price  vector  p*  with  item  j  priced  at  p*  (i.e.,  gp*{S)  =  OPTp), 
consider  a  price  vector  p  with  pj  in  [(1  —  S)p*,  (1  —  5  +  S‘^)p*]  if  p*  >  ^  and  0  otherwise,  where 
Pj  =  (1  +  e)^  for  some  integer  k  (note  that  such  a  price  vector  always  exists).  We  show  now  that 
gp{S)  >  (1  —  2y/e)gp*{S),  which  clearly  implies  the  desired  result. 

Let  J  be  a  multi-set  of  items  and  Profit(J)  =  be  the  payment  necessary  to  purchase  bundle 

J  under  pricing  p*.  Define  Rj  =  p*  —  Pj.  Thus  we  have: 

(S  -  S^)p-  <  R,  <  max{ip-,  <  Sp-  + 

This  implies  fhaf  for  any  mulfisef  J  wifh  |  J|  <  k,  we  have  fhe  following  upper  and  lower  bounds: 


E«.> 

j&J 

>  {6  —  <5^)Profil(  J)  , 

(7.6.1) 

E«j 

j&j' 

<  5Profil(j')  +  ^. 

(7.6.2) 

Lef  J*  and  J*  be  fhe  bundles  fhaf  bidder  i  prefers  under  pricing  p*  and  p,  respectively.  Consider 
bidder  i  who  swifches  from  bundle  J*  fo  bundle  J*  when  fhe  ifem  prices  are  decreased  from  p*  fo  p.  This 
implies  fhaf: 


E  S  E  Rf 

jeJ*  jeJi 

Combining  Ibis  wifh  equations  (7.6.1 )  and  (7.6.2)  and  canceling  a  common  factor  of  5  we  see  fhaf: 

(1  -  <5)Profi((j;)  <  Profi((Ji)  + 

Summing  over  all  bidders  i,  we  see  fhaf  fhe  fofal  profif  under  our  new  pricing  p  is  af  leasf  (1  — 
(5)  OPTp  —h5.  Since  OPTp  >  h,  we  finally  obfain  fhaf  fhe  profif  under  p  is  af  leasf  (1  —  26)  OPTp.  ■ 

Note  fhaf  we  can  now  apply  Theorem  7.6.1  by  leffing  Q'  be  fhe  offer  class  induced  by  fhe  class  of 
ifem  prices  equal  to  0  or  powers  of  (1  -|-  e)  in  fhe  range  [^,  h]  (where  k  bounds  fhe  maximum  size  of  a 
bundle).  Using  Theorem  7.3.1  we  obfain  fhe  following  guarantee: 

Corollary  7.6.2  Given  a  f3-approximation  algorithm  A  optimizing  over  Q' ,  then  with  probability  at  least 
1  —  6,  the  profit  ofRSOg>^^  is  at  least  (1  —  3e)OPTp/ /3  so  long  as 

OPTg,  >  ^  (mln(logi+,2  nk)+ In  (|))  . 


163 


7.6.2  Bounds  via  Counting 

We  now  show  how  to  use  the  technique  of  counting  possible  outcomes  (See  Section  7.3.3 )  to  get  a  bound 
on  the  performance  of  the  random  sampling  auction  with  an  algorithm  A  for  item-pricing.  This  approach 
calls  for  bounding  \Ga\>  the  number  of  different  pricing  schemes  RSO(g^^)  can  possibly  output.  Our 
results  for  this  approach  are  summarized  in  the  second  row  of  Table  7.6 

Recall  that  bidder  i’s  utility  for  a  bundle  J  given  pricing  p  is  Ui{J,  p)  =  Vi{J)  —  ■  We  now 

make  the  following  claim  about  the  regions  of  the  space  of  possible  pricings,  TZ^,  in  which  bidder  i’s 
most  desired  bundle  is  fixed. 

Claim  1  Let  Pi{J)  =  {p  |  VJ',  Ui{J,  p)  >  Ui{J' ,  p)}.  The  set  Pi{J,  p)  is  a  polytope. 

Proof:  This  follows  immediately  from  the  observation  that  the  region  Pi{J)  is  convex  and  the  only  way 
to  pack  convex  regions  into  space  is  if  they  are  polytopes. 

To  show  that  Pi{J)  is  convex,  suppose  the  allocation  to  a  particular  bidder  for  p  and  p'  are  the  same, 
J.  Then  for  any  other  bundle  J'  we  have: 

Vi{J)  -  '^Pj>  Vi{J')  -  ^  Pj 
j£j  j&J' 


and 

-YjPj-  Py 

j&j  j&j' 

If  we  now  consider  any  price  vector  ap  +  (1  —  Q;)p',  for  a  G  [0, 1],  these  imply: 

Vi{J)  -  '^{aPj  +  (1  -  a)pj)  >  Vi{J')  -  '^{apj  +  (1  -  a)p'). 
jeJ  j&J' 

This  clearly  implies  that  this  agent  prefers  allocation  J  on  any  convex  combination  of  p  and  p'.  Hence 
the  region  of  prices  for  which  the  agent  prefers  bundle  J  is  convex.  ■ 

The  above  claim  shows  that  we  can  divide  the  space  of  pricings  into  polytopes  based  on  an  agent’s  most 
desirable  bundle.  Consider  fixing  an  outcome,  i.e.,  the  bundles  Ji, . . . ,  Jn,  obtained  by  agents  1, . . . ,  n, 
respectively.  This  outcome  occurs  for  pricings  in  the  intersection  Hies  Pi{Ji)- 

Definition  7.6.1  For  a  set  of  agents  S,  let  VertSg  denote  the  set  of  vertices  of  the  poly  topes  that  partition 
the  space  of  prices  by  the  allocation  produced.  I.e.,  VertSs  =  {p  such  that  p  is  a  vertex  of  the  polytope 
containing  HieS'  Pi{Ji)  for  some  i  G  S'  C  S  and  bundles  Ji}. 

Claim  2  For  S'  C  S  we  have  Verts^/  C  VertSs. 

Proof:  Follows  immediately  from  the  definition  of  Verts^  and  basic  properties  of  polytopes.  ■ 

Now  we  consider  optimal  pricings.  Note  that  when  fixing  an  allocafion  Ji, . . . ,  J„  we  are  looking 
for  an  optimal  price  point  within  the  polytope  that  gives  this  allocation.  Our  objective  function  for  this 
optimization  is  linear.  Let  rij  be  the  number  of  copies  of  item  j  allocated  by  the  allocation.  The  seller’s 
payoff  for  prices  p  =  (pi, . . .  ,Pm)  is  Thus,  all  optimal  pricings  of  this  allocation  lie  on  facets 

of  the  polytope  and  in  particular  there  is  an  optimal  pricing  that  is  at  a  vertex  of  the  polytope.  Over  the 
space  of  all  possible  allocations,  all  optimal  pricings  are  on  facets  of  the  allocation  defining  polyfopes  and 
there  exists  an  optimal  pricing  that  is  at  a  vertex  of  one  of  the  polytopes. 

Lemma  7.6.3  Given  an  algorithm  A  that  always  outputs  a  vertex  of  the  poly  tope  then  Qa  F  VertSs. 


164 


Proof:  This  follows  from  the  fact  that  RSO(g  _4)  runs  ^  on  a  subset  S'  of  S  which  has  VertS^/  C  VertSs- 
A  must  pick  a  price  vector  from  VertS^/.  By  Claim  2  this  price  vector  must  also  be  in  VertS^.  This  gives 
the  lemma.  ■ 

We  now  discuss  getting  a  bound  on  Verts^  for  n  agents,  m  distinct  items,  and  various  types  of  pref¬ 
erences. 

Theorem  7.6.4  We  have  the  following  upper  bounds  on  |  VertS^I.' 

1.  (n  +  m)^  for  single-minded  preferences. 

2.  n^(m  +  unit-demand  preferences. 

3.  for  arbitrary  preferences. 

Proof:  We  consider  how  many  possible  bundles,  M,  an  agent  might  obtain  as  a  function  of  the  pricing. 
An  agent  with  single-minded  preferences  will  always  obtain  one  of  Mg  =  2  bundles:  either  their  desired 
bundle  or  nothing  (the  empty  bundle).  An  agent  with  unit-demand  preferences  receives  one  of  the  m  items 
or  nothing  for  a  total  of  Mu  =  m  -|-  1  possible  bundles.  An  agent  with  general  preferences  receives  one 
of  the  Mg  =  2”^  possible  bundles.^ 

We  now  bound  the  number  of  hyperplanes  necessary  to  partition  the  pricing  space  into  M  convex 
regions  (e.g.,  that  specify  which  bundle  the  agent  receives).  For  convex  regions,  each  pair  of  regions  can 
meet  in  at  most  one  hyperplane.  Thus,  the  total  number  of  hyperplanes  necessary  to  partition  the  pricing 
space  into  regions  is  at  most  [^).  Of  course  we  wish  to  restrict  our  pricings  to  be  non-negative,  so  we 
must  add  m  additional  hyperplanes  at  pj  =  0  for  all  j. 

For  all  n  agents,  we  simply  intersect  the  regions  of  all  agents.  This  does  not  add  any  new  hyperplanes. 
Furthermore,  we  only  need  to  count  the  m  hyperplanes  that  restrict  to  non-negative  pricings  once.  Thus, 
the  total  number  of  hyperplanes  necessary  for  specifying  the  regions  of  allocation  for  n  agents  with  M 
convex  regions  each,  is  K  =  -|-  m.  Thus,  Kg  =  n  -\-  m,  Ku  <  +  m  <  n{m  +  1)^,  and 

Kg  <  -\-  m  <  n2^'"  (for  m  >  2). 

Of  course,  K  hyperplanes  in  m  dimensional  space  intersect  in  at  most  (^)  <  K'^  vertices.  Not  all 
of  these  intersections  are  vertices  of  polytopes  defining  our  allocation,  still  iT™  is  an  upper  bound  on  the 
size  of  VertS^.  Plugging  this  in  gives  us  the  desired  bounds  of  (n  -|-  m)™,  rf^{m  -\-  1)^"^,  and  n”^2^"^ 
respectively  for  single-minded,  unit-demand,  and  general  preferences.  ■ 

We  note  that  are  above  arguments  apply  to  approximation  algorithms  that  always  output  a  price  corre¬ 
sponding  to  the  vertex  of  a  polytope  as  well.  Though  we  do  not  consider  this  direction  here,  it  is  entirely 
possible  that  it  is  not  computationally  difficult  to  post-process  the  solution  of  an  algorithm  that  is  not  a 
vertex  of  a  polytope  to  get  a  solution  that  is  on  a  vertex  of  a  polytope.^  This  would  further  motivate  the 
analysis  above.  If  for  some  reason,  restricting  to  algorithms  that  return  vertices  is  undesirable,  it  is  possible 
to  use  cover  arguments  on  the  set  of  vertices  we  obtain  when  we  add  additional  hyperplanes  corresponding 
to  the  discretization  of  the  preceding  section. 

7.6.3  Combinatorial  Auctions:  Lower  Bounds 

We  show  in  the  following  an  interesting  lower  bound  for  combinatorial  auctions.^  Notice  that  our  upper 
bounds  and  this  lower  bound  are  quite  close. 

^Here  we  make  the  assumption  that  desired  bundles  are  simple  sets.  If  they  are  actually  multi-sets  with  bounded  multiplicity 
k,  then  the  agent  could  receive  one  of  at  most  Mg  =  (fc  +  1)™'  bundles. 

^Notice  that  this  is  not  immediate  because  of  the  complexity  of  representing  an  agent’s  combinatorial  valuation. 

®This  proof  follows  the  standard  approach  for  lower  bounds  for  revenue  maximizing  auctions  that  was  first  given  by  Goldberg 
et  al.  in  1 123 1. 


165 


Theorem  7.6.5  Fix  m  and  h.  There  exists  a  probability  distribution  on  unit-demand  single-minded  agents 
such  that  the  expected  revenue  of  any  incentive  compatible  mechanism  is  at  most  ^  whereas  the  expected 
revenue  o/OPT  is  at  least  0.7mh. 

Thus,  this  theorem  states  that  in  order  to  achieve  a  close  multiplicative  ratio  with  respect  to  OPT,  one 
must  have  additive  loss  Fl{mh). 

Proof:  Consider  the  following  probability  distribution  over  valuations  of  agents  preferences.  Assume 
we  have  n  =  ^  agents  in  total,  and  |  agents  desire  item  j  only,  j  G  {1,  •  •  •  Each  of  these  agents 

has  valuation  h  with  probability  ^  and  valuation  1  with  probability  1  — 

Notice  now  any  incentive-compatible  mechanism  has  expected  profit  at  most  n.  To  see  this,  note  that 
for  each  bidder,  any  proposed  price  has  expected  profit  (over  the  randomization  in  the  selection  of  his 
valuation)  of  at  most  1.  Moreover,  the  expected  profit  of  OPTp  is  at  least  n  +  For  each  item  j,  there 
is  a  1  —  (1  —  ~  0.4  probability  that  some  bidder  has  valuation  h.  For  those  items,  OPTp  gets  at 

least  a  profit  of  h.  For  the  rest,  OPTp  gets  a  profit  of  |.  So,  overall,  OPTp  gets  an  expected  profit  of  at 
least  OAmh  +  0.6m(/i/2)  =  0.7h.  All  these  together  imply  the  desired  result.  ■ 

7.6.4  Algorithms  for  Item-pricing 

Given  standard  complexity  assumptions,  most  item-pricing  problems  are  not  polynomial  time  solvable, 
even  for  simple  special  cases.  We  review  these  results  here.  We  focus  our  attention  to  the  unlimited 
supply  special  case,  though  some  of  the  work  we  mention  also  considers  limited  supply  item-pricing. 
Algorithmic  pricing  problems  in  this  form  were  first  posed  by  Guruswami  et  al.  1 129|  though  item-pricing 
for  unit-demand  consumers  with  several  alternative  payment  rules  (i.e.,  rules  that  do  not  represent  quasi- 
linear  utility  maximization)  were  independently  considered  by  Aggarwal  et  al.  1 10|. 

For  consumers  with  single-minded  preferences,  1 129 1  gives  a  simple  O(logmn)  approximation  algo¬ 
rithm.  Demaine  et  al.  [  101 1  show  the  problem  to  be  hard  to  approximate  to  better  than  a  log'^(m)  factor  for 
some  5  >  0.  Both  Driest  and  Krysta  [73 1  and  Grigoriev  et  al.  [  126]  proved  that  optimal  pricing  is  weakly 
NP-hard  for  the  special  case  known  as  “the  highway  problem”  where  there  is  a  linear  order  on  the  items 
and  all  desired  bundles  are  for  sets  of  consecutive  items  (actually  this  hardness  result  follows  for  the  more 
specific  case  where  fhe  desired  bundles  for  any  fwo  agenfs.  Si  and  Si>,  salisfy  one  of:  Si  C  5j/,  C  Si, 
or  Si  U  Si>  =  0).  In  fhe  case  when  fhe  cardinalify  of  fhe  desired  bundles  are  bounded  by  k,  Dries!  and 
Krysfa  [73]  give  an  0{k‘^)  approximation  algorifhm.  In  our  work  [24]  we  have  improved  fhis,  by  giving  a 
simpler  and  heller  0{k)  approximation.  Finally,  when  fhe  number  of  dislincl  ilems  for  sale,  m,  is  consfanf, 
Harfline  and  Kollun  [  131  j  show  fhal  if  is  possible  fo  improve  on  fhe  frivial  0(n™)  algorifhm  by  giving 
a  near-linear  lime  approximafion  scheme.  Their  approximation  algorifhm  is  acfually  an  exacl  algorifhm 
for  fhe  problem  of  optimizing  over  a  discrefized  sel  of  item  prices  Q'  which  is  direclly  applicable  lo  our 
auction  RSO(g/  _4),  discussed  above. 

For  consumers  wifh  unif-demand  preferences,  [129]  (and  [10]  essentially)  give  a  frivial  logarifhmic 
approximafion  algorifhm  and  show  fhaf  fhe  opfimizafion  problem  is  APX-hard  (meaning  fhaf  sfandard 
complexify  assumptions  imply  fhaf  Ihere  does  nol  exisl  a  polynomial  lime  approximation  scheme  (PTAS) 
for  fhe  problem).  Again,  Harfline  and  Kollun  [131]  show  how  lo  improve  on  fhe  frivial  0{n^)  algo¬ 
rifhm  in  fhe  case  where  fhe  number  of  disfincf  ilems  for  sale,  m,  is  consfanf.  They  give  a  near-linear  lime 
approximafion  scheme  fhaf  is  based  on  considering  a  discrefized  sel  of  item  prices;  however,  fhe  discrefiza- 
lion  of  Nisan  [  178]  fhaf  we  discussed  above  gives  a  significanl  improvemenl  on  Iheir  algorifhm  and  also 
generalizes  if  lo  be  applicable  lo  fhe  problem  of  item-pricing  for  consumers  wifh  general  combinalorial 
preferences. 

'"’Notice  that  these  preferences  are  both  unit-demand  and  single-minded. 


166 


7.7  Conclusions  and  Discussion 


In  this  work  we  have  made  an  explicit  connection  between  machine  learning  and  mechanism  design.  In 
doing  so,  we  obtain  a  unified  approach  to  considering  a  variety  of  profit  maximizing  mechanism  design 
problems  including  many  that  have  been  previously  considered  in  the  literature. 

Some  of  our  techniques  give  suggestions  for  the  design  of  mechanisms  and  others  for  their  analysis. 
In  terms  of  design,  these  include  the  use  of  discretization  to  produce  smaller  function  classes,  and  the  use 
of  structural-risk-minimization  to  choose  an  appropriate  level  of  complexity  of  the  mechanism  for  a  given 
set  of  bidders.  In  terms  of  analysis,  these  include  both  the  use  of  basic  sample-complexity  arguments,  and 
the  notion  of  multiplicative  covers  for  better  bounding  the  true  complexity  of  a  given  class  of  offers. 

Our  results  substantially  generalize  the  previous  work  on  random  sampling  mechanisms  by  both 
broadening  the  applicability  of  such  mechanisms  and  by  simplifying  the  analysis.  Our  bounds  on  ran¬ 
dom  sampling  auctions  for  digital  goods  not  only  show  how  the  auction  profit  approaches  the  optimal 
profit,  but  also  weaken  the  required  assumptions  of  1 121 1  by  a  constant  factor.  Similarly,  for  random  sam¬ 
pling  auctions  for  multiple  digital  goods,  our  unified  analysis  gives  a  bound  fhaf  weakens  fhe  assumpfions 
of  [  1 19 1  by  a  faclor  of  more  fhan  m,  fhe  number  of  disfincf  items.  This  mulfiple  digifal  good  auction  prob¬ 
lem  is  a  special  case  of  fhe  a  more  general  unlimited  supply  combinatorial  aucfion  problem  for  which  we 
obfain  fhe  firsf  posifive  worsf-case  resulfs  by  showing  fhaf  if  is  possible  fo  approximate  fhe  opfimal  profif 
wifh  an  incenfive-compafible  mechanism.  Furthermore,  unlike  fhe  case  for  combinaforial  auctions  for 
social  welfare  maximization,  our  incenfive-compafible  mechanisms  can  be  easily  based  on  approximation 
algorifhms  insfead  of  exacf  ones. 

We  have  also  explored  fhe  attribute  aucfion  problem  fhaf  was  proposed  in  |60|  for  1 -dimensional 
affribules  in  a  much  more  general  setting:  fhe  attribufe  values  can  be  mulli-dimensional  and  fhe  largel 
pricing  funclions  considered  can  be  arbifrarily  complex.  We  bound  fhe  performance  of  random  sampling 
aucfions  as  a  function  of  fhe  complexify  of  fhe  fargef  pricing  functions. 

Our  random  sampling  aucfions  assume  fhe  exisfence  of  exacf  or  approximate  pricing  algorifhms.  So¬ 
lutions  fo  fhese  pricing  problem  have  been  proposed  for  several  of  our  settings.  In  particular,  opfimal 
ifem-pricings  for  combinaforial  aucfions  in  fhe  single-minded  and  unif-demand  special  cases  have  been 
considered  in  |24,  73,  129,  131 1.  On  fhe  ofher  hand  for  attribute  auctions,  many  of  fhe  clustering  and 
markef-segmenfing  pricing  algorifhms  have  yef  fo  be  considered  af  all. 


167 


168 


Chapter  8 


Bibliography 

[1]  http://www.kemel-machines.org/.  1.1.2,  3.1 

[2]  6th  Kernel  Machines  Workshop.  NIPS,  2002.  http;//www-stat.ucdavis.edu/ nello/nips02.html.  1.1.2 

[3]  The  Seventh  Workshop  on  Kernel  Machines.  COIW,  2003.  http://learningtheory.org/colt2003.  1.1.2 

[4]  Workshop  Graphical  Models  and  Kernels.  NIPS,  2004.  http://users.rsise.anu.edu.au/  smola/workshops/nips04. 
1.1.2 

[5]  Kernel  Methods  and  Structured  Domains.  NIPS,  2005.  http://nips2005.kyb.tuebingen.mpg.de.  1.1.2 

[6]  1.1.1 

[7]  D.  Achlioptas.  Database-friendly  random  projections.  Journal  of  Computer  and  System  Sciences,  66(4): 
671-687,  2003.  6.1, 6.4 

[8]  D.  Achlioptas  and  F.  McSherry.  On  spectral  learning  of  mixtures  of  distributions.  In  COLT,  2005.  1.1.3  4.1 
4.1.1 

[9]  G.  Aggarwal  and  J.  Hartline.  Knapsack  Auctions.  In  Proceedings  of  the  17th  ACM-SIAM  Symposium  on 
Discrete  Algorithms,  2006.  7.1 , 7.5.3 

[10]  G.  Aggarwal,  T.  Feder,  R.  Motwani,  and  A.  Zhu.  Algorithms  for  multi-product  pricing.  In  Proceedings  of 
the  International  Colloquium  on  Automata,  Languages,  and  Programming,  pages  72-83,  2004.  7.6.4 

[11]  N.  Ailon,  M.  Charikar,  and  A.  Newman.  Aggregating  inconsistent  information:  ranking  and  clustering.  In 
STOC,  pages  684-693,  2005.  1 

[12]  P.  Alimonti  and  V.  Kann.  Hardness  of  approximating  problems  on  cubic  graphs.  In  Algorithms  and  Com¬ 
plexity,  1997.  4.8.1 

[13]  N.  Alon  and  N.  Kahale.  A  spectral  technique  for  coloring  random  3-colorable  graphs.  SIAM  J.  Computing, 
26(6):1733-  1748,  1997.  1, 4.1.1 

[14]  N.  Alon,  W.  Fernandez  de  la  Vega,  R.  Kannan,  and  M.  Karpinski.  Random  sampling  and  approximation  of 
max-csps.  Journal  of  Computer  and  Systems  Sciences,  67(2):212-243,  2003.  1.1.3, 4,  4.1.2, 4.6,  4.6.4 

[15]  M.-R.  Amini,  O.  Chapelle,  and  R.  Ghani,  editors.  Learning  with  Partially  Classified  Training  Data.  Work¬ 
shop,  ICML’05,  2005.  2.1 

[16]  D.  Angluin.  Queries  and  concept  learning.  Machine  Learning,  2:319-342,  1998.  5.1.1  1 

[17]  D.  Angluin.  Queries  revisited.  Theoretical  Computer  Science,  3 13(2):  175-194,  2004.  5.1.1,  1 

[18]  M.  Anthony  and  P.  Bartlett.  Neural  Network  Learning:  Theoretical  Foundations.  Cambridge  University 
Press,  1999.  1.1.2,  1.1.4,  3.1 , 3.2,  3.3.3,  6,  7.1 , 7.3.3,  7.3.3 , 7.5 

[19]  S.  Arora  and  R.  Kannan.  Learning  mixtures  of  arbitrary  gaussians.  In  ACM  Symposium  on  Theory  ofCom- 
puting,2005.  1.1.3, 4.1, 4.1.1 


169 


[20]  S.  Arora,  L.  Babai,  J.  Stern,  and  Z.  Sweedyk.  The  hardness  of  approximate  optima  in  lattices,  codes,  and 
systems  of  linear  equations.  Journal  of  Computer  and  System  Sciences,  54:317  -  331,  1997.  3.2,  3.3.3 

[21]  R.  I.  Arriaga  and  S.  Vempala.  An  algorithmic  theory  of  learning,  robust  concepts  and  random  projection, 
pages  616-623,  1999.  6.1  6.4,  6.4.1 

[22]  B.  Awerbuch,  Y.  Azar,  and  A.  Meyerson.  Reducing  truth-telling  online  mechanisms  to  online  optimization. 
In  Proceedings  of  the  35th  Annual  ACM  Symposium  on  Theory  of  Computing,  2003.  7. 1 

[23]  M.-F.  Balcan  and  A.  Blum.  A  PAC-style  model  for  learning  from  labeled  and  unlabeled  data.  In  Proceedings 
of  the  Annual  Conference  on  Computational  Learning  Theory,  2005.  1.2 

[24]  M.-F.  Balcan  and  A.  Blum.  On  a  theory  of  learning  with  similarity  functions.  In  International  Conference  on 
Machine  Learning,  2006.  1.2  ,  3.1, 3.3, 4,  3.3.3,  3.4,  3.4.2,  3.4.4,  9  ,  3.4.6,  3.6, 4.1, 4.1.3, 4.2  ,  4.4,  4.4,  7.6.4, 
7.7 

[25]  M.-F.  Balcan  and  A.  Blum.  Approximation  Algorithms  and  Online  Mechanisms  for  Item  Pricing.  TOC,  2007. 
1.1.4,  1.2 

[26]  M.-F.  Balcan  and  A.  Blum.  Approximation  Algorithms  and  Online  Mechanisms  for  Item  Pricing.  In  Pro¬ 
ceedings  of  the  7th  ACM  Conference  on  Electronic  Commerce,  2006.  1.1.4  ,  1.2 

[27]  M.-F.  Balcan  and  A.  Blum.  An  augmented  PAC-model  for  semi-supervised  learning.  Book  chapter  in  ’’Semi- 
Supervised  Learning”,  O.  Chapelle,  B.  Schlkopf,  and  A.  Zien,  eds.,  MIT  press,  2006.  1  1.2 

[28]  M.  F.  Balcan,  A.  Blum,  and  K.  Yang.  Co-training  and  expansion:  Towards  bridging  theory  and  practice.  In 
Advances  in  Neural  Information  Processing  Systems,  2004.  1.2  ,  2.3.2,  2.4.2 

[29]  M.-F.  Balcan,  A.  Blum,  J.  Hartline,  and  Y.  Mansour.  Mechanism  Design  via  Machine  Learning.  In  46th 
Annual  IEEE  Symposium  on  Foundations  of  Computer  Science,  2005.  1.2 

[30]  M.-F.  Balcan,  A.  Beygelzimer,  and  J.  Langford.  Agnostic  active  learning.  In  International  Conference  on 
Machine  Learning,  2006.  1.2,  2.5.4,  7,5,2,  1 , 5.2.1 , 5.2.1 , 5.2.2,  5.2.2 

[31]  M.-F.  Balcan,  A.  Blum,  and  S.  Vempala.  On  kernels,  margins  and  low-dimensional  mappings.  Machine 
Learning  Journal,  2006.  1.2,  3.3 , 3.6, 4.1 , 4.1.3, 4.2,  4.4 

[32]  M.-F.  Balcan,  A.  Blum,  H.  Chan,  and  M.T  Hajiaghayi.  A  theory  of  loss-leaders:  Making  money  by  pricing 
below  cost.  In  Proc.  3rd  International  Workshop  on  Internet  and  Network  Economics.  Lecture  Notes  in 
Computer  Science,  2007.  1.1.4,  1.2 

[33]  M.-F.  Balcan,  A.  Broder,  and  T.  Zhang.  Margin  based  active  learning.  In  Proceedings  of  the  20th  Annual 
Conference  on  Computational  Learning  Theory  ( COLT),  2007.  1.2,  2.5.4,  5,  5.1.1, 5.1.1, 5.1.4,  5.1.4,  5.2.2, 
5.2.3 

[34]  M.-F.  Balcan,  E.  Even-Dar,  S.  Hanneke,  M.  Kearns,  Y.  Mansour,  and  J.  Woitman.  Asymptotic  active  learning. 
In  Workshop  on  Principles  of  Learning  Design  Problem.  In  conjunction  with  the  21st  Annual  Conference  on 
Neural  Information  Processing  Systems  (NIPS),  2007.  2.5.4,  5,3, 5.3 

[35]  M.-F  Balcan,  A.  Beygelzimer,  and  J.  Langford.  Agnostic  active  learning.  Journal  of  Computer  and  System 
Sciences,  200S.  1.2,  5,  5.2.2 

[36]  M.-F.  Balcan,  A.  Blum,  J.  Hartline,  and  Y.  Mansour.  Reducing  mechanism  design  to  algorithm  design  via 
machine  learning.  Journal  of  Computer  and  System  Sciences,  2008.  to  appear.  1.2 

[37]  M.-F.  Balcan,  A.  Blum,  and  Y.  Mansour.  Item  Pricing  for  Revenue  Maximization.  In  Proceedings  of  the  9th 
ACM  Conference  on  Electronic  Commerce,  2008.  1.2  5 

[38]  M.-F.  Balcan,  A.  Blum,  and  N.  Srebro.  A  theory  of  learning  with  similarity  functions.  Machine  Learning 
Journal,  2008.  1.2  ,  3.1, 3.3, 4,  3.4.4 

[39]  M.-F.  Balcan,  A.  Blum,  and  N.  Srebro.  Improved  guarantees  for  learning  via  similarity  functions.  In  COLT, 
2008.  1.2,  3.1,4 

[40]  M.-F.  Balcan,  A.  Blum,  and  S.  Vempala.  A  discriminative  framework  for  clustering  via  similarity  functions. 
In  Proceedings  of  the  40th  ACM  Symposium  on  Theory  of  Computing,  2008.  1.2,  3.6 


170 


[41]  M.-F.  Balcan,  S.  Hanneke,  and  J.  Wortman.  The  true  sample  complexity  of  active  learning.  In  Proceedings 
of  the  21st  Annual  Conference  on  Computational  Learning  Theory  (COLT),  2008.  1.2,  2.5.4,  5,3, 5.3 

[42]  M.-F.  Balcan,  A.  Blum,  and  A.  Gupta.  Approximate  clustering  without  the  approximation.  In  ACM-SIAM 
Symposium  on  Discrete  Algorithms  (SODA),  2009.  1.2,  4.7,  4.7 

[43]  P.  Bartlett  and  S.  Mendelson.  Rademacher  and  Gaussian  Complexities  Risk  Bounds  and  Structural  Results. 
Journal  of  Machine  Learning  Research,  54(3):463^82,  2002.  2.1.1 , 2.3.1 , 2.5.4,  3.2 

[44]  P.  Bartlett  and  J.  Shawe-Taylor.  Generalization  performance  of  support  vector  machines  and  other  pattern 
classifiers.  \n  Advances  in  Kernel  Methods:  Support  Vector  Learning.  MIT  Press,  1999.  6.1  6.3, 6.4.1 

[45]  P.  Bartlett,  S.  Boucheron,  and  G.  Lugosi.  Model  selection  and  error  estimation.  In  Proceedings  of  the  13th 
Annual  Conference  on  Computational  Learning  Theory.  2.3.1 

[46]  E.  Baum  and  K.  Lang.  Query  learning  can  work  poorly  when  a  human  oracle  is  used.  In  International  Joint 
Conference  on  Neural  Networks,  1993.  5.1.1 

[47]  E.  B.  Baum.  Polynomial  time  algorithms  for  learning  neural  nets.  In  Proceedings  of  the  third  annual  work¬ 
shop  on  Computational  learning  theory,  pages  258  -  272,  1990.  2.1 

[48]  S.  Ben-David.  A  priori  generalization  bounds  for  kernel  based  learning.  In  NIPS  Workshop  on  Kernel  Based 
Learning,  pages  991  -  998,  2001.  6.1 

[49]  S.  Ben-David.  A  framework  for  statistical  clustering  with  constant  time  approximation  for  k-means  cluster¬ 
ing.  Machine  Learning  Journal,  2007.  4.1.3 

[50]  S.  Ben-David,  N.  Eiron,  and  H.-U.  Simon.  Limitations  of  learning  via  embeddings  in  euclidean  half-spaces. 
The  Journal  of  Machine  Learning  Research,  3:441  -  461,  2003.  3.4,  3.4.3,  6.1 

[51]  A.  Ben-Israel  and  T.N.E.  Greville.  Generalized  Inverses:  Theory  and  Applications .  Wiley,  New  York,  1974. 
6.3 

[52]  G.M.  Benedek  and  A.  Itai.  Learnability  by  fixed  distributions.  In  Proc.  1st  Workshop  Computat.  Learning 
Theory,  pages  80-90,  1988.  3.4.3 

[53]  G.M.  Benedek  and  A.  Itai.  Learnability  with  respect  to  a  fixed  distribution.  Theoretical  Computer  Science, 
86:377-389,  1991.  2.1 , 2.1.1 , 2.5.4 

[54]  K.  P.  Bennett  and  C.  Campbell.  Support  vector  machines:  hype  or  hallelujah?  SIGKDD  Explor.  NewsL,  2(2): 
1-13,  2000.  3.4,  3.4.2 

[55]  T.  De  Bie  and  N.  Cristianini.  Convex  methods  for  transduction.  In  Proceedings  of  the  Seventeenth  Annual 
Conference  on  Neural  Information  Processing  Systems,  volume  16,  2003.  2.2 

[56]  T.  De  Bie  and  N.  Cristianini.  Convex  transduction  with  the  normalized  cut.  Internal  Report  04-128,  ESAT- 
SISTA,  K.U.Leuven,  2004.  2.5.1 

[57]  C.L.  Blake  and  C.  J.  Merz.  UCI  Repository  of  Machine  Learning  Databases.  1998.  http://www.ics.uci.edu/ 
mleam/MLRepository.html.  7.5.1 

[58]  A.  Blum.  Machine  learning  theory.  Essay,  2007.  1.1.1 

[59]  A.  Blum  and  S.  Chawla.  Learning  from  labeled  and  unlabeled  data  using  graph  mincuts.  In  Proc.  ICML, 
pages  19-26,2001.  1.1.1 , 2.2,  2.3.2  ,  2.5.1 

[60]  A.  Blum  and  J.  Hartline.  Near-Optimal  Online  Auctions.  In  Proceedings  of  the  16th  ACM-SIAM  Symposium 
on  Discrete  Algorithms,  pages  1156  -  1163,  2005.  7.1 , 7.3.1 , 7.5.1 , 7.5.1 , 7.5.1 , 7.5.3, 7.7 

[61]  A.  Blum  and  R.  Kannan.  Learning  an  intersection  of  k  halfspaces  over  a  uniform  distribution.  Journal  of 
Computer  and  Systems  Sciences,  54(2):371-380,  1997.  2.1 

[62]  A.  Blum  and  T.  M.  Mitchell.  Combining  labeled  and  unlabeled  data  with  co-training.  In  COLT,  1998.  1.1.1 
2.1 , 2.1.2,  2.2,  2.3.2,  2.4.2, 2.4.2,  1 , 2.4.2,  2.5.4 

[63]  A.  Blum,  M.  Eurst,  J.  Jackson,  M.  Kearns,  Y.  Mansour,  and  S.  Rudich.  Weakly  learning  DNE  and  character¬ 
izing  statistical  query  learning  using  fourier  analysis.  In  Proceedings  of  the  26th  Annual  ACM  Symposium  on 


171 


Theory  of  Computing,  pages  253-262,  1994.  3.4.3 

[64]  A.  Blum,  A.  Frieze,  R.  Kannan,  and  S.  Vempala.  A  polynomial-time  algorithm  for  learning  noisy  linear 
threshold  functions.  Algorithmica,  22:35-52,  1998.  2.1.2,  2.4.2,  2.4.2,  2.4.2,  2.4.2 

[65]  A.  Blum,  V.  Kumar,  A.  Rudra,  and  F.  Wu.  Online  Learning  in  Online  Auctions.  In  Proceedings  of  the  14th 
ACM-SIAM  Symposium  on  Discrete  Algorithms,  pages  137  -  146,  2003.  7.1 

[66]  A.  Blum,  N.  Bansal,  and  S.  Chawla.  Correlation  clustering.  Machine  Learning,  56:89-113,  2004.  1 

[67]  A.  Blum,  J.  Lafferty,  R.  Reddy,  and  M.  R.  Rwebangira.  Semi-supervised  learning  using  randomized  mincuts. 
InICML  ’04,  2004.  2.3.2,  2.5.1 

[68]  A.  Blumer,  A.  Ehrenfeucht,  D.  Haussler,  and  M.  Warmuth.  Occam’s  razor.  Information  Processing  Letters, 
24:377-380,  1987.  5.1.2,  5.1.3 

[69]  A.  Blumer,  A.  Ehrenfeucht,  D.  Haussler,  and  M.  K.  Warmuth.  Learnability  and  the  Vapnik  Chervonenkis 
dimension.  Journal  of  the  ACM,  36(4):929-965,  1989.  2.1 , 2.1.1 , 7.1 , 7.5 

[70]  C.  Borgs,  J.  T.  Chayes,  N.  Immorlica,  M.  Mahdian,  and  A.  Saberi.  Multi-unit  auctions  with  budget- 
constrained  bidders.  In  Proceedings  of  the  6th  ACM  Conference  on  Electronic  Commerce,  pages  44-51, 

2005.  1.1.4,  7.2.4 

[71]  S.  Boucheron,  G.  Lugosi,  and  R  Massart.  A  sharp  concentration  inequality  with  applications.  Random 
Structures  and  Algorithms,  16:277-292,2000.  2.3.1,  A. 1.1 

[72]  O.  Bousquet,  S.  Boucheron,  and  G.  Lugosi.  Theory  of  Classihcation:  A  Survey  of  Recent  Advances.  ESAIM: 
Probability  and  Statistics,  2005.  1.1.3 , 2.1.1 , 2.3.1 , 2.3.1, 4.1.2,  4.2,  5.1.2,  7.3.3 

[73]  P.  Briest  and  P.  Krysta.  Single-Minded  Unlimited  Supply  Pricing  on  Sparse  Instances.  In  Proceedings  of  the 
17th  ACM-SIAM  Symposium  on  Discrete  Algorithms,  2006.  7.6.4,  7.7 

[74]  N.  H  Bshouty  and  N.  Eiron.  Learning  monotone  dnf  from  a  teacher  that  almost  does  not  answer  membership 
queries.  Journal  of  Machine  Learning  Research,  3:49-57,  2002.  5.1.1 ,  1 

[75]  M.  Bumashev  and  K.  Zigangirov.  An  interval  estimation  problem  for  controlled  observations.  Problems  in 
Information  Transmission,  10:223-231,  1974.  5.1.1, 5.1.4 

[76]  V.  Castelli  and  T.M.  Cover.  On  the  exponential  value  of  labeled  samples.  Pattern  Recognition  Letters,  16: 
105-111,  1995.  2.1,  1, 2.5.2 

[77]  V.  Castelli  and  T.M.  Cover.  The  relative  value  of  labeled  and  unlabeled  samples  in  pattern  recognition  with 
an  unknown  mixing  parameter.  IEEE  Transactions  on  Information  Theory,  42(6):2102-21 17,  1996.  2.1 ,  1, 
2.5.2 

[78]  R.  Castro  and  R.  Nowak.  Minimax  bounds  for  active  learning.  In  Proceedings  of  the  20th  Annual  Conference 
on  Computational  Learning  Theory  ( COLT),  2007.  5.1.1 , 5.1.1 

[79]  R.  Castro,  R.  Willett,  and  R.  Nowak.  Easter  rates  in  regression  via  active  learning.  In  Advances  in  Neural 
Information  Processing  Systems,  volume  18,  2006.  5.1.1, 5.1.1 

[80]  Rui  M.  Castro  and  Robert  D.  Nowak.  Upper  and  lower  error  bounds  for  active  learning.  In  The  44th  Annual 
Allerton  Conference  on  Communication,  Control  and  Computing,  2006.  2  ,  5.2.2,  5.2.2,  5.2.2 

[81]  O.  Chapelle  and  A.  Zien.  Semi-supervised  classification  by  low  density  separation.  In  Tenth  International 
Workshop  on  Artificial  Intelligence  and  Statistics,  2005.  2.2 

[82]  O.  Chapelle,  B.  Scholkopf,  and  A.  Zien,  editors.  Semi-Supervised  Learning.  MIT  Press,  Cambridge,  MA, 

2006.  URL  http  :  /  / www .  kyb  .  tuebingen  .  mpg .  de/  ss  1-book  2.1 , 2.3.1 , 2.4,  2.6.1 , 3.4.1 

[83]  M.  Charikar,  S.  Guha,  E.  Tardos,  and  D.  B.  Shmoy.  A  constant-factor  approximation  algorithm  for  the 
k-median  problem.  In  ACM  Symposium  on  Theory  of  Computing,  1999.  1.1.3  4.1, 4.1.1 

[84]  M.  Charikar,  V.  Guraswami,  and  A.  Wiith.  Clustering  with  qualitative  information.  In  Proceedings  of  the 
44th  Annual  Symposium  on  Foundations  of  Computer  Science,  pages  524—533,  2003.  1 

[85]  D.  Cheng,  R.  Kannan,  S.  Vempala,  and  G.  Wang.  A  divide-and-merge  methodology  for  clustering.  ACM 


111 


Trans.  Database  Syst.,  31(4):1499-1525,  2006.  4.9 

[86]  D.  Cohen,  L.  Atlas,  and  R.  Ladner.  Improving  generalzation  with  active  learning.  Machine  Learning,  15(2): 
201-221,  1994.  1.1.1, 2.5.4,  7,  5, 5. 1.1, 5. 1.3, 5.2,  1 , 5.2.1 , 5.2.1 

[87]  M.  Collins  and  Y.  Singer.  Unsupervised  models  for  named  entity  classification.  In  Proceedings  of  the  Joint 
SIGDAT  Conference  on  Empirical  Methods  in  Natural  Language  Processing  and  Very  Large  Corpora,  pages 
189-196,  1999.  2.2 

[88]  C.  Cortes  and  V.  Vapnik.  Support-vector  networks.  Machine  Learning,  20(3):273  -  297,  1995.  1.1.2,  3.1 

[89]  N.  Cristianini,  J.  Shawe-Taylor,  Andre  Elisseeff,  and  J.  Kandola.  On  kernel  target  alignment.  In  Advances  in 
Neural  Information  Processing  Systems,  2001.  1.1.2  ,  3.1 

[90]  J.  Czyzowicz,  D.  Mundici,  and  A.  Pelc.  Ulam’s  searching  game  with  lies.  Journal  of  Combinatorial  Theory, 
Series  A,  52:62-76,  1989.  5.1.1 

[91]  A.  Dasgupta,  J.  Hopcroft,  J.  Kleinberg,  and  M.  Sandler.  On  learning  mixtures  of  heavy-tailed  distributions. 
In  46th  IEEE  Symposium  on  Foundations  of  Computer  Science,  2005.  1.1.3, 4.1 

[92]  A.  Dasgupta,  J.  E.  Hopcroft,  R.  Kannan,  and  R  P.  Mitra.  Spectral  clustering  by  recursive  partitioning.  In 
ESA,  pages  256-267,  2006.  1 , 4.1.1 

[93]  S.  Dasgupta.  Analysis  of  a  greedy  active  learning  strategy.  In  Advances  in  Neural  Information  Processing 
Systems,  2004.  5.1.1 

[94]  S.  Dasgupta.  Coarse  sample  complexity  bounds  for  active  learning.  In  Proceedings  of  the  Nineteenth  Annual 
Conference  on  Neural  Information  Processing  Systems,  2005.  1.1.1, 2.5.4 ,5,5.1.1,5.1.15.1 .4 ,  5.3 

[95]  S.  Dasgupta.  Learning  mixtures  of  gaussians.  In  Fortieth  Annual  IEEE  Symposium  on  Foundations  of  Com¬ 
puter  Science,  1999.  1.1.3, 4.1, 4.1.1 

[96]  S.  Dasgupta  and  A.  Gupta.  An  elementary  proof  of  the  Johnson-Lindenstrauss  Lemma.  Random  Structures 
&  Algorithms,  22(l):60-65,  2002.  6  ,  6.1 , 6.4 

[97]  S.  Dasgupta,  M.  L.  Littman,  and  D.  McAllester.  Pac  generalization  bounds  for  co-training.  In  Advances  in 
Neural  Information  Processing  Systems  14,  2001.  2.3.2 

[98]  S.  Dasgupta,  A.  Kalai,  and  C.  Monteleoni.  Analysis  of  perceptron-based  active  learning.  In  Proceedings  of 
the  Eighteenth  Annual  Conference  on  Learning  Theory,  2005.  5, 5.1.1, 5.1.1 , 5.1.4,  5.1.4,  6,  1 , 5.2.1 

[99]  S.  Dasgupta,  D.J.  Hsu,  and  C.  Monteleoni.  A  general  agnostic  active  learning  algorithm.  Advances  in  Neural 
Information  Processing  Systems,  20,  2007.  2  ,  5.1.5, 5.1.6,  5.2 

[100]  W.  Eemandez  de  la  Vega,  Marek  Karpinski,  Claire  Kenyon,  and  Yuval  Rabani.  Approximation  schemes  for 

clustering  problems.  In  2003.  4.1.1 

[101]  E.  Demaine,  U.  Eeige,  M.T.  Hajiaghayi,  and  M.  Salavatipour.  Combination  Can  Be  Hard:  Approximability  of 
the  Unique  Coverage  Problem  .  In  Proceedings  of  the  17th  ACM-SIAM  Symposium  on  Discrete  Algorithms, 
2006.  7.5.3,  7.6.4 

[102]  L.  Devroye  and  G.  Lugosi.  Combinatorial  Methods  in  Density  Estimation.  Springer- Verlag,  2001.  7.5.1 

[103]  L.  Devroye,  L.  Gyorfi,  and  G.  Lugosi.  A  Probabilistic  Theory  of  Pattern  Recognition.  Springer- Verlag,  1996. 
1.1.3, 2.1.1,2.3.1,4.1.1,4.1.2,4.2,4.6  7.3.3, 7.5,  A.  1.1,  A.  1.1,  A.  1.1,  A.3.1 

[104]  J.  Dunagan  and  S.  Vempala.  Optimal  outlier  removal  in  high-dimensional  spaces.  In  Proceedings  of  the  33rd 
ACM  Symposium  on  Theory  of  Computing,  2001.  2.4.2  ,  2.4.2 

[105]  A.  Ehrenfeucht,  D.  Haussler,  M.  Kearns,  and  L.  Valiant.  A  general  lower  bound  on  the  number  of  examples 
needed  for  learning.  Inf.  and  Comput,  82:246-261,  1989.  2.4.1 

[106]  A.  Eiat,  A.  Goldberg,  J.  Hartline,  and  A.  Karlin.  Competitive  Generalized  Auctions.  In  Proceedings  34th 
ACM  Symposium  on  the  Theory  of  Computing,  pages  72  -  81,  2002.  7.1 

[107]  P.  Eische  and  S.  Kwek.  Minimizing  Disagreement  for  Geometric  Regions  Using  Dynamic  Programming, 
with  Applications  to  Machine  Learning  and  Computer  Graphics.  1996.  7.5.1 


173 


[108]  A.  Flaxman.  Personal  communication,  2003.  2.4.2 

[109]  J.  Forster  and  H.-U.  Simon.  On  the  smallest  possible  dimension  and  the  largest  possible  margin  of  linear 
arrangements  representing  given  concept  classes.  Theoretical  Computer  Science,  350(1);40^8,  2006.  3.4, 
3.4.3 

[110]  Y.  Freund  and  R.  E.  Schapire.  Large  margin  classification  using  the  perceptron  algorithm.  Machine  Learning, 
37(3):277  -  296,  1999.  3.1, 3.3.4 

[111]  Y.  Freund  and  R.E.  Schapire.  Large  margin  classification  using  the  Perceptron  algorithm.  Machine  Learning, 
37(3):277-296,  1999.  6.4.1 

[112]  Y.  Ereund,  H.S.  Seung,  E.  Shamir,  and  N.  Tishby.  Selective  sampling  using  the  query  by  committee  algorithm. 
Machine  Learning,  28(2-3):  133-168,  1997.  5, 5.1.1,  1 

[113]  A.  Erieze  and  R.  Kannan.  Quick  approximation  to  matrices  and  applications.  Combinatorica,  19(2):  175-220, 
1999.  1.1.3, 4,  4.1.2, 4.6 

[114]  K.  Ganchev,  J.  Graca,  J.  Blitzer,  and  B.  Taskar.  Multi-view  learning  over  structured  and  non-identical  outputs. 
In  Proceedings  of  The  24th  Conference  on  Uncertainty  in  Artificial  Intelligence,  2008.  2.6.1 

[115]  C.  Gentile  and  M.  K.  Warmuth.  Linear  hinge  loss  and  average  margin.  In  Proceedings  of  the  1998  conference 
on  Advances  in  neural  information  processing  systems,  1988.  3.4.2 

[116]  R.  Ghani.  Combining  labeled  and  unlabeled  data  for  text  classification  with  a  large  number  of  categories.  In 
Proceedings  of  the  IEEE  International  Conference  on  Data  Mining,  2001.  2.2 

[117]  R.  Ghani,  R.  Jones,  and  C.  Rosenberg,  editors.  The  Continuum  from  Labeled  to  Unlabeled  Data  in  Machine 
Learning  and  Data  Mining.  Workshop  ICML’03,  2003.  2.1 

[118]  E.  Girosi.  An  equivalence  between  sparse  approximation  and  support  vector  machines.  Neural  Computation, 
10(6):  1455-1480,  1998.  3.4.2 

[119]  A.  Goldberg  and  J.  Hartline.  Competitive  Auctions  for  Multiple  Digital  Goods.  In  Proceedings  of  the  9th 
Annual  European  Symposium  on  Algorithms,  pages  416  -  427,  2001.  7.1 , 7.6,  5 , 7.7 

[120]  A.  Goldberg  and  J.  Hartline.  Competitiveness  via  Consensus.  In  Proceedings  of  the  14th  ACM-SIAM  Sym¬ 
posium  on  Discrete  Algorithms,  pages  215  -  222,  2003.  7.3 

[121]  A.  Goldberg,  J.  Hartline,  and  A.  Wright.  Competitive  Auctions  and  Digital  Goods.  In  Proceeding  of  the  12th 
ACM-SIAM  Symposium  on  Discrete  Algorithms,  pages  735-744,  2001.  1.1.4,  7.1 , 7.3, 7.3, 4,  7.7 

[122]  A.  Goldberg,  J.  Hartline,  A.  Karlin,  M.  Saks,  and  A.  Wright.  Competitive  Auctions  and  Digital  Goods. 
Games  and  Economic  Behavior,  2002.  Submitted  for  publication.  An  earlier  version  available  as  InterTrust 
Technical  Report  STAR-TR-99.09.01.  1 

[123]  A.  Goldberg,  J.  Hartline,  A.  Karlin,  and  M.  Saks.  A  Lower  Bound  on  the  Competitive  Ratio  of  Truthful 
Auctions.  In  Proceedings  21st  Symposium  on  Theoretical  Aspects  of  Computer  Science,  pages  644—655, 
2004.  9 

[124]  A.  Goldberg,  J.  Hartline,  A.  Karlin,  M.  Saks,  and  A.  Wright.  Competitive  Auctions  and  Digital  Goods. 
Games  and  Economic  Behavior,  2006.  1.1.4,  7.1 , 7.3 , 7.4 

[125]  O.  Goldreich,  S.  Goldwasser,  and  S.  Micali.  How  to  construct  random  functions.  Journal  of  the  ACM,  33(4): 
792-807,  1986.  6.5 

[126]  A.  Grigoriev,  J.  van  Loon,  R.  Sitters,  and  M.  Uetz.  How  to  Sell  a  Graph:  Guideliness  for  Graph  Retailers. 
Meteor  Research  Memorandum  RM/06/001,  Maastricht  University,  2005.  7.6.4 

[127]  V.  Guigue,  A.  Rakotomamonjy,  and  S.  Canu.  Kernel  basis  pursuit.  In  Proceedings  of  the  16th  European 
Conference  on  Machine  Learning  (ECML’05),  2005.  3.4,  3.4.2 

[128]  S.  R.  Gunn  and  J.  S.  Kandola.  Structural  modelling  with  sparse  kernels.  Mach.  Learn.,  48(1-3):137-163, 
2002.  ISSN  0885-6125.  3.4.2 

[129]  V.  Guruswami,  J.  Hartline,  A.  Karlin,  D.  Kempe,  C.  Kenyon,  and  E.  McSherry.  On  Profit-Maximizing  Envy- 
Eree  Pricing.  In  Proceedings  of  the  16th  ACM-SIAM  Symposium  on  Discrete  Algorithms,  pages  1164-1 173, 


174 


2005.  1.1.4,  7.1, 7.2.4,  5, 7.6.4,  7.7 

[130]  S.  Hanneke.  A  bound  on  the  label  complexity  of  agnostic  active  learning.  In  Proceedings  of  the  24th  Annual 
International  Conference  on  Machine  Learning  (ICML),  2007.  5.1.5, 5.1.5, 5.1.5, 5.1.5,  5.1.6 

[131]  J.  Hartline  and  V.  Koltun.  Near-Optimal  Pricing  in  Near-Linear  Time.  In  Proceedings  of  the  9th  Workshop 
on  Algorithms  and  Data  Structures,  pages  422^31,  2005.  7.6.1 , 5,  7.6.4,  7.7 

[132]  B.  Heisele,  P.  Ho,  and  T.  Poggio.  Face  recognition  with  support  vector  machines:  Global  versus  component- 
based  approach.  In  International  Conference  on  Computer  Vision,  2001.  1.1.2 

[133]  R.  Herbrich.  Learning  Kernel  Classifiers.  MIT  Press,  Cambridge,  2002.  1.1.2,  3.1 , 4.1 , 4.1.3, 4.2, 4.4 

[134]  R.  Hettich  and  K.  O.  Kortanek.  Semi-infinite  programming:  theory,  methods,  and  applications.  SIAM  Rev., 
35(3):380^29,  1993.  3.3.5 

[135]  R.  Hwa,  M.  Osborne,  A.  Sarkar,  and  M.  Steedman.  Corrected  co-training  for  statistical  parsers.  In  ICML- 
03  Workshop  on  the  Continuum  from  Labeled  to  Unlabeled  Data  in  Machine  Learning  and  Data  Mining, 
Washington  D.C.,  2003.  1.1.1 

[136]  P.  Indyk  and  R.  Motwani.  Approximate  nearest  neighbors:  towards  removing  the  curse  of  dimensionality.  In 
Proceedings  of  the  30th  Annual  ACM  Symposium  on  Theory  of  Computing,  pages  604—613,  1998.  6.4 

[  1 37]  J.  Jackson.  An  efficient  membership-query  algorithm  for  learning  dnf  with  respect  to  the  uniform  distribution. 
Journal  of  Computer  and  System  Sciences,  57(3):414^40,  1995.  5.1.1,  1 

[138]  K.  Jain  and  V.  V.  Vazirani.  Approximation  algorithms  for  metric  facility  location  and  k-median  problems 
using  the  primal-dual  schema  and  lagrangian  relaxation.  JACM,  48(2):274  -  296,  2001.  1.1.3 , 4.1 , 4.1.1 

[139]  T.  Joachims.  Learning  to  Classify  Text  Using  Support  Vector  Machines:  Methods,  Theory,  and  Algorithms. 
Kluwer,2002.  1.1.2  ,  3.1 

[140]  T.  Joachims.  Transductive  learning  via  spectral  graph  partitioning.  In  Proceedings  of  the  International 
Conference  on  Machine  Learning  (ICML),  2003.  2.3.2,  2.5.1 

[141]  T.  Joachims.  Transductive  inference  for  text  classification  using  support  vector  machines.  In  Proc.  ICML, 
pages  200-209,  1999.  1.1.1 , 2.1 , 2.2,  2.5.4 

[142]  W.  B.  Johnson  and  J.  Lindenstrauss.  Extensions  of  Lipschitz  mappings  into  a  Hilbert  space.  In  Conference 
in  Modern  Analysis  and  Probability,  pages  189-206,  1984.  6.1, 6.4 

[143]  M.  Kaariainen.  On  active  learning  in  the  non-realizable  case.  In  ALT’,  2006.  5.1.1  5.1.4  5.1.4 

[144]  M.  Kaariainen.  Generalization  error  bounds  using  unlabeled  data.  In  Proceedings  of  the  18th  Annual  Con¬ 
ference  on  Learning  Theory,  pages  127-142,  2005.  2.1 , 2.5.4 

[145]  A.  Kalai,  A.  Klivans,  Y.  Mansour,  and  R.  Servedio.  Agnostically  learning  halfspaces.  In  Proceedings  of  the 
46th  Annual  Symposium  on  the  Foundations  of  Computer  Science,  2005.  3.3.3 , 5.2.3 

[146]  R.  Kannan,  S.  Vempala,  and  A.  Vetta.  On  clusterings:  good,  bad  and  spectral.  J.  ACM,  51(3):497-515,  2004. 
1.1.3, 4.1, 4.1.1 

[147]  R.  Kannan,  H.  Salmasian,  and  S.  Vempala.  The  spectral  method  for  general  mixture  models.  In  Proc.  18th 
Annual  Conference  on  Learning  Theory,  2005.  1.1.3, 4.1 , 4.1.1 

[148]  M.  Kearns.  Efficient  noise-tolerant  learning  from  statistical  queries.  In  Journal  of  the  ACM  (JACM),  pages 
983  -  1006,  1998.  6 

[149]  M.  Kearns  and  U.  Vazirani.  An  Introduction  to  Computational  Learning  Theory.  MIT  Press,  1994.  2.1  2.1.1 
2.3.1, 2.4.2,  6,  3.3, 7.1, 7.5 

[150]  J.  Kleinberg.  Detecting  a  network  failure.  In  Proceedings  of  the  41st  IEEE  Symposium  on  Foundations  of 
Computer  Science,  pages  231-239,  2000.  2.5.1 

[151]  J.  Kleinberg.  An  impossibility  theorem  for  clustering.  InATP5,  2002.  1.1.3, 4.1 

[152]  J.  Kleinberg,  M.  Sandler,  and  A.  Slivkins.  Network  failure  detection  and  graph  connectivity.  In  Proceedings 
of  the  41st  IEEE  Symposium  on  Foundations  of  Computer  Science,  pages  231-239,  2004.  2.5.1 


175 


[153]  A.  R.  Klivans,  R.  O’Donnell,  and  R.  Servedio.  Learning  intersections  and  thresholds  of  halfspaces.  In 
Proceedings  of  the  43rd  Symposium  on  Foundations  of  Computer  Science,  pages  177-186,  2002.  2.1 

[154]  D.  E.  Knuth.  The  Art  of  Computer  Programming.  Addison-Wesley,  1997.  2 

[155]  V.  Koltchinskii.  Rademacher  Penalties  and  Structural  Risk  Minimization.  IEEE  Transactions  of  Information 
Theory,  54(3):  1902-1914,  2001.  2.1.1, 2.3.1, 2.5.4 

[156]  R.  I.  Kondor  and  J.  Lafferty.  Diffusion  kernels  on  graphs  and  other  discrete  structures.  In  Proc.  ICML,  2002. 
4.1.3 

[157]  G.  R.  G.  Lanckriet,  N.  Cristianini,  P.  L.  Bartlett,  L.  El  Ghaoui,  and  M.  I.  Jordan.  Learning  the  kernel  matrix 
with  semidefinite  programming.  Journal  of  Machine  Learning  Research,  (5):27-72,  2004.  1.1.2,  3.1 , 3.3.4 

[158]  B.  Leskez.  The  value  of  agreement:  A  new  boosting  algorithm.  In  Proceedings  of  the  Annual  Conference  on 
Computational  Learning  Theory,  2005.  2.2,  2.5.4 

[159]  A.  Levin,  P.  Viola,  and  Y.  Ereund.  Unsupervised  improvement  of  visual  detectors  using  co-training.  In 
Proc.  9th  Int.  Conf.  Computer  Vision,  pages  626-633,  2003.  1.1.1  2.1  2.2 

[160]  L.  Liao  and  W.  S.  Noble.  Combining  pairwise  sequence  similarity  and  support  vector  machines  for  detecting 
remote  protein  evolutionary  and  structural  relationships.  Journal  of  Computational  Biology,  10(6):857-868, 
2003.  3.2,  3.6 

[161]  A.  Likhodedov  and  T.  Sandholm.  Approximating  Revenue-Maximizing  Combinatorial  Auctions.  In  The 
Twentieth  National  Conference  on  Artificial  Intelligence  (AAAI),  pages  267-274,  2005.  5 

[162]  N.  Linial,  Y.  Mansour,  and  N.  Nisan.  Constant  depth  circuits,  fourier  transform,  and  learnability.  In  Pro¬ 
ceedings  of  the  Thirtieth  Annual  Symposium  on  Foundations  of  Computer  Science,  pages  574-579,  Research 
Triangle  Park,  North  Carolina,  October  1989.  2.1  3.4.3 

[163]  N.  Littlestone.  Prom  online  to  batch  learning.  In  Proc.  2nd  Annual  ACM  Conference  on  Computational 
Learning  Theory,  pages  269-284,  1989.  3.4.2,  4.10 

[164]  N.  Littlestone.  Learning  when  irrelevant  attributes  abound:  A  new  linear-threshold  algorithm.  Machine 
Learning,  2:285-318,  1987.  6 

[165]  P.  M.  Long.  On  the  sample  complexity  of  PAC  learning  halfspaces  against  the  uniform  distribution.  IEEE 
Transactions  on  Neural  Networks,  6(6):  1556-1559,  1995.  5.1.4,  5.2.1 

[166]  R.  Luss  and  A.  d’Aspremont.  Support  vector  machine  classification  with  indefinite  kernels.  In  Advances  in 
Neural  Information  Processing  Systems,  2007.  3.2 

[167]  Y.  Mansour.  Learning  boolean  functions  via  the  fourier  transform.  In  Theoretical  Advances  in  Neural  Com¬ 
putation  and  Learning,  391-424.  1994.  3.4.3 

[168]  D.  McAllester.  Simplified  pac-bayesian  margin  bounds.  In  Proceedings  of  the  16th  Conference  on  Compu¬ 
tational  Learning  Theory,  2003.  3.2 

[169]  P.  McSherry.  Spectral  parititioning  of  random  graphs.  In  Proc.  43rd  Symp.  Foundations  of  Computer  Science, 
pages  529-537,  2001.  1, 4.1.1 

[170]  S.  Mendelson  and  P.  Philips.  Random  subclass  bounds.  In  Proceedings  of  the  16th  Annual  Conference  on 
Computational  Learning  Theory  (COLT),  2003.  2.5.3 

[171]  R.  Meyerson.  Optimal  Auction  Design.  Mathematics  of  Opperations  Research,  6:5^-13,  1983.  1.1.4,  7.1 

[172]  T.  Mitchell.  The  discipline  of  machine  learning.  CMU-ML-06  108,  2006.  1 ,  1.1.1  3.4.1 

[173]  K.  R.  Muller,  S.  Mika,  G.  Ratsch,  K.  Tsuda,  and  B.  Scholkopf.  An  introduction  to  kernel-based  learning 
algorithms.  IEEE  Transactions  on  Neural  Networks,  12:181  -  201,  2001.  1.1.2  ,  3.1 

[174]  R.  Myerson.  Optimal  Auction  Design.  Mathematics  of  Operations  Research,  6:5?i-13,  1981.  1 

[175]  K.  Nigam  and  R.  Ghani.  Analyzing  the  effectiveness  and  applicability  of  Co-training.  In  Proc.  ACM  CIKM 
Int.  Conf.  on  Information  and  Knowledge  Management,  pages  86-93,  2000.  2.2 

[176]  K.  Nigam,  A.  McCallum,  S.  Thrun,  and  TM.  Mitchell.  Text  classification  from  labeled  and  unlabeled  docu- 


176 


ments  using  EM.  Mach.  Learning,  39(2/3):  103-134,  2000.  1.1.1 , 2.1 

[177]  N.  Nisan,  T.  Rougharden,  E.  Tardos,  and  V.  Vazirani  (Eds.).  Algorithmic  Game  Theory.  2006.  To  appear. 

1.1.4  ,  7.2.2 

[178]  N.  Nissan.  Personal  communication,  2005.  7.6.1, 7.6.4 

[179]  E.  E.  Osuna  and  E.  Girosi.  Reducing  the  run-time  complexity  in  support  vector  machines.  In  Advances  in 
kernel  methods:  support  vector  learning,  pages,  211-22)3.  1999.  ISBN  0-262-19416-3.  3.4.2 

[180]  S.  Park  and  B.  Zhang.  Large  scale  unstructured  document  classification  using  unlabeled  data  and  syntactic 
information.  In  PAKDD  2003,  LNCS  vol.  2637,  pages  88-99.  Springer,  2003.  2.2 

[181]  S.-B.  Park  and  B.-T.  Zhang.  Co-trained  support  vector  machines  for  large  scale  unstructured  document 
classification  using  unlabeled  data  and  syntactic  information.  Information  Processing  and  Management, 
40(3):421 -439,2004.  1.1.1 

[182]  D.  Pierce  and  C.  Cardie.  Limitations  of  Co-Training  for  natural  language  learning  from  large  datasets.  In 
Proc.  Conference  on  Empirical  Methods  in  NLP,  pages  1-9,  2001.  2.2 

[183]  J.  Ratsaby  and  S.  Venkatesh.  Learning  from  a  mixture  of  labeled  and  unlabeled  examples  with  parametric 
side  information.  In  Proceedings  of  the  Eighth  Annual  Conference  on  Computational  Learning  Theory,  pages 
412^17,  1995.  2.5.2 

[  1 84]  D.  Rosenberg  and  P.  Bartlett.  The  Rademacher  Complexity  of  Co-Regularized  Kernel  Classes.  In  Proceedings 
of  Artificial  Intelligence  &  Statistics,  2007.  2.2,  2.5.4,  2.6.1 

[185]  V.  Roth.  Sparse  kernel  regressors.  In  ICANN  ’01:  Proceedings  of  the  International  Conference  on  Artificial 
Neural  Networks,  2001.  3.4,  3.4.2 

[186]  B.  Scholkopf  and  A.  J.  Smola.  Learning  with  kernels.  Support  Vector  Machines,  Regularization,  Optimiza¬ 
tion,  and  Beyond.  MIT  University  Press,  Cambridge,  2002.  3.2,  3.3 

[187]  B.  Scholkopf,  K.  Tsuda,  and  J.-P.  Vert.  Kernel  Methods  in  Computational  Biology.  MIT  Press,  2004.  1.1.2 

3.1. 4.1. 4.1.3. 4.2. 4.4 

[188]  R.E.  Shapire.  The  strength  of  weak  leamability.  Machine  Learning,  (5):  197-227,  1990.  3.3.2 

[189]  J.  Shawe-Taylor.  Rademacher  Analysis  and  Multi-View  Classification.  2006. 

http://www.gla.ac.uk/exteiTial/RSS/RSScomp/shawe-taylor.pdf.  2.2  2.6. 1 

[190]  J.  Shawe-Taylor  and  N.  Cristianini.  Kernel  Methods  for  Pattern  Analysis.  Cambridge  University  Press,  2004. 
1.1.2,  3.1, 4.1, 4.1.3, 4.2, 4.4 

[191]  J.  Shawe-Taylor,  P.  L.  Bartlett,  R.  C.  Williamson,  and  M.  Anthony.  Structural  risk  minimization  over  data- 
dependent  hierarchies.  IEEE  Transactions  on  Information  Theory,  44(5):  1926-1940,  1998.  1.1.2  ,  2.1.3, 2.5  , 

2.5.3. 3.1. 4.4,  6,  6.1, 6.3, 6.4.1 

[192]  Y.  Singer.  Leveraged  vector  machines.  In  Advances  in  Neural  International  Proceedings  System  12,  2000. 

3.4,  3.4.2 

[193]  A.  J.  Smola,  P.  Bartlett,  B.  Scholkopf,  and  D.  Schuurmans.  Advances  in  Large  Margin  Classifiers.  MIT  Press, 
2000.  1.1.2,  3.1 

[194]  N.  Sokolovska,  O.  Capp,  and  E.  Yvon.  The  asymptotics  of  semi-supervised  learning  in  discriminative  proba¬ 
bilistic  models.  In  Proceedings  of  the  25th  International  Conference  on  Machine  Learning,  2008.  2.5.4 

[195]  N.  Srebro.  How  good  is  a  kernel  as  a  similarity  function?  In  Proc.  20th  Annual  Conference  on  Learning 
Theory,  2007.  3.3,  3.4.1, 3.4.4,  3.4.4,  3.4.4,  4.4 

[196]  K.  Sridharan  and  S.  M.  Kakade.  An  information  theoretic  framework  for  multi-view  learning.  In  Proceedings 
of  the  21st  Annual  Conference  on  Learning  Theory,  2008.  2.2,  3,  2.6.1 

[197]  C.  Swamy.  Correlation  clustering:  Maximizing  agreements  via  semidefinite  programming.  In  SODA,  2004. 
1 

[198]  M.  E.  Tipping.  Sparse  bayesian  learning  and  the  relevance  vector  machine.  J.  Mach.  Learn.  Res.,  1:21 1-244, 


111 


2001.  ISSN  1533-7928.  3.4,  3.4.2 

[199]  S.  Tong  and  D.  Koller.  Support  vector  machine  active  learning  with  applications  to  text  classification.  Journal 
of  Machine  Learning  Research,  4:45-66,  2001.  1 , 5.2 

[200]  A.  Tsybakov.  Optimal  aggregation  of  classifiers  in  statistical  learning.  Annals  of  Statistics,  2004.  2,  5.2.2 

[201]  L.G.  Valiant.  A  theory  of  the  learnable.  Commwn.  ACM,  27(11):  1134-1 142,  1984.  (document),  1,2.1,  3.3, 
4.1.1,  4.8.3 

[202]  V.  Vapnik  and  A.  Chervonenkis.  On  the  uniform  convergence  of  relative  frequencies  of  events  to  their  prob¬ 
abilities.  Theory  of  Probability  and  its  Applications,  16(2):264-280,  1971.  5.1.2,  5.1.3 

[203]  V.  N.  Vapnik.  Statistical  Learning  Theory.  John  Wiley  and  Sons  Inc.,  1998.  (document)  1  1.1.2  1.1.3 
1.1.4,  2.1 , 3.1 , 4.1.1 , 4.1.2, 4.2,  4.2,  4.4,  4.9,  4.10,  5.1.3, 7.1 , 7.3.3, 7.5 

[204]  S.  Vempala.  A  random  sampling  based  algorithm  for  learning  the  intersection  of  half-spaces.  In  Proceedings 
of  the  38th  Symposium  on  Foundations  of  Computer  Science,  pages  508-513,  1997.  2.1 

[205]  S.  Vempala  and  G.  Wang.  A  spectral  algorithm  for  learning  mixture  models.  J.  Comp.  Sys.  Sci.,  68(2): 
841-860,2004.  1.1.3, 4.1 , 4.1.1 

[206]  K.  A.  Verbeurgt.  Learning  DNF  under  the  uniform  distribution  in  quasi-polynomial  time.  In  COLT,  pages 
314-326,  1990.  2.1 

[207]  P.  Vincent  and  Y.  Bengio.  Kernel  matching  pursuit.  Machine  Learning,  48(1-3):165-187,  2002.  3.4.2 

[208]  L.  Wang,  C.  Yang,  and  J.  Feng.  On  learning  with  dissimilarity  functions.  In  Proceedings  of  the  24th  interna¬ 
tional  conference  on  Machine  learning,  pages  991  -  998,  2007.  3.1 , 3.6 

[209]  M.  K.  Warmuth  and  S.  V.  N.  Vishwanathan.  Leaving  the  span.  In  Proceedings  of  the  Annual  Conference  on 
Learning  Theory,  2005.  3.4 

[210]  D.  Yarowsky.  Unsupervised  word  sense  disambiguation  rivaling  supervised  methods.  In  Meeting  of  the 
Association  for  Computational  Linguistics,  pages  189-196,  1995.  2.1, 2.2 

[211]  T.  Zhang.  Covering  number  bounds  of  certain  regularized  linear  function  classes.  J.  Mach.  Learn.  Res.,  2: 
527-550,  2002.  ISSN  1533-7928.  3.4.2,  3.4.2 

[212]  T.  Zhang.  Regularized  winnow  methods.  In  NIPS,  2001.  6 

[213]  D.  Zhou,  O.  Bousquet,  T.  N.  Lai,  J.  Weston,  and  B.  Schlkopf  Learning  with  local  and  global  consistency.  In 
NIPS,2004.  2.5.1 

[214]  X.  Zhu.  Semi-Supervised  Learning  Literature  Survey.  2006.  Computer  Sciences  TR  1530  University  of 
Wisconsin  -  Madison.  2. 1 

[215]  X.  Zhu,  Z.  Ghahramani,  and  J.  Lafferty.  Semi-supervised  learning  using  gaussian  fields  and  harmonic  func¬ 
tions.  In  Proc. /CML,  pages  912-912,  2003.  1.1.1 , 2.1 , 2.2,  2.5.1 

[216]  X.  Zhu,  Z.  Ghahramani,  and  J.  Lafferty.  Semi-supervised  learning:  From  gaussian  fields  to  gaussian 
processes.  Technical  report,  Carnegie  Mellon  University,  2003.  2.5.1 


178 


Appendix  A 

Additional  Proof  and  Known  Results 


A.l  Appendix  for  Chapter  2 

A.1.1  Standard  Results 


We  state  in  the  following  a  few  known  generalization  bounds  and  concentration  results  used  in  our  proofs.  We  start 
with  a  classic  result  from  1 103 1. 


Theorem  A.1.1  Suppose  that  C  is  a  set  of  functions  from  X  to  {—1, 1}  with  finite  VC-dimension  D  >  1.  Let  D  be 
an  arbitrary,  but  fixed  probability  distribution  over  X  x  {—1,1}.  For  any  e,  i5  >  0,  if  we  draw  a  sample  from  D  of 
size 


m(e,  6,  D) 


then  with  probability  at  least  1 


5,  we  have 


err(h)  —  L{h) 


<  efor  all  f  G  C. 


We  present  now  another  classic  results  from  [  103|. 

Theorem  A.1.2  Suppose  that  C  is  a  set  of  functions  from  X  to  {—1, 1}  with  finite  VC-dimension  D  >1.  Let  D  be 
an  arbitrary,  but  fixed  probability  distribution  over  X  x  {  —  1,1}.  Then 


err(/)  -  L{f) 

sup 

>  e 

_/GC,L(/)-0 

So,  for  any  e,  S  >  0,  if  we  draw  a  sample  from  D  of  size 


m  >  -  (  21n  (C[2m,  D])  +  In  (  - 


then  with  probability  at  least  1  —  5,  we  have  that  all  functions  with  L(f)  =  0  satisfy  err(/)  <  e. 

We  present  now  another  classic  results  from  [  103|. 

Theorem  A.1.3  Suppose  that  C  is  a  set  of  functions  from  X  to  {—1, 1}  with  finite  VC-dimension  D  >1.  Let  D  be 
an  arbitrary,  but  fixed  probability  distribution  over  X  x  {  —  1,1}.  Then 


sup 

err(/)  -  L{f) 

' 

Al 

_/cc 

So,  for  any  e,  (5  >  0,  if  we  draw  from  D  a  sample  satisfying 


w  >  ^  In  (C[m,  Z?])  +  In  -  , 


then  with  probability  at  least  1  —  5  all  functions  f  satisfy  err(/)  —  L(f) 


>  €. 


179 


We  now  state  a  result  from  |71 1. 

Theorem  A.1.4  Suppose  that  C  is  a  set  of  functions  from  X  to  {—1, 1}.  Let  D  be  an  arbitrary,  but  fixed  probability 
distribution  over  X  x  {—1, 1}.  Then  for  any  target  f  &  C  and  for  any  i.i.d.  sample  of  S  of  size  mfrom  D,  let  fm  be 
the  function  that  minimizes  the  empirical  error  over  S.  Then  for  any  (5  >  0,  the  probability  that 


err(/m)  <  L{f^)  + 


61nC[g]  ^ 

m  V  TO 


is  greater  than  1  —  (5. 

Note  that  in  fact  the  above  statement  is  true  even  if  in  the  right  handside  we  use  instead  of  C'[S']  where  S' 
is  another  i.i.d  sample  of  size  to  drawn  from  D. 

Theorem  A.1.5  For  any  class  of  functions  we  have: 


Pr[log2(C[5])>E[log2(C[^])]+a] 


< 


exp 


2E[log2(C[5])]+2a/3 


(A.1.1) 


Also, 


E[log2C[5]]  <  log2E[C[^]]  <  ^E[log2C[5]]. 

A.1.2  Additional  Proofs 


Theorem  A.1.6  For  any  class  of  functions  we  have: 


Pr[log2(C[^])  >21ogE[C[5]]  +  a]  < 


Proof:  Inequality  (A.  1.1)  implies  that: 


Pr  [log2(C[5])  >  2E[log2(C[5])]  +  a]  <  exp 


(a  +  E[log2(C[5])])^ 

2E[log2(C[^])]  +  2(E[log2(C[5])]  +  a)/3_  ' 


Since  2ai2(a+I)/3  ^  f  for  any  a  >  0  we  get 

Pr  [log2(C[5])  >  2E[log2(C[5])]  +  a]  <  6"“/^ 

Combining  this  together  with  the  following  fact  (implied  by  Inequality  (A.1.2)) 

Pr  [log^{C[S])  >  21ogE[C[^]]  +  a]  <  Pr  [log2(C[5])  >  2E[log2(C[5])]  +  a], 

we  get  the  desired  result.  ■ 


(A.1.2) 


(A.  1.3) 


A.2  Appendix  for  Chapter  5 

Theorem  A.2.1  Let  C  be  a  set  of  functions  from  X  to  {—1, 1}  with  finite  VC-dimension  D  >  1.  Let  P  be  an 
arbitrary,  but  fixed  probability  distribution  over  X  x  {—1, 1}.  For  any  e,  5  >  0,  if  we  draw  a  sample  from  P  of  size 
N{e,  i5)  =  f  (4Z)log  (^)  +  2  log  (|))  ,  then  with  probability  1  —  (5,  all  hypotheses  with  error  >  e  are  inconsistent 
with  the  data. 


180 


A.2.1  Probability  estimation  in  bigb  dimensional  ball 

Consider  x  =  [xi, . . . ,  a;^]  ~  Px  uniformly  distributed  on  unit  ball  in  R‘^.  Let  A  be  an  arbitrary  set  in  we  are 
interested  in  estimating  the  probability  Vxx{{xi,X2)  €  A).  Let  Vd  be  the  volume  of  d-dimensional  ball;  we  know 

Cd  =  ^^/Vr(i  +  d/2), 

where  L  is  the  Gamma-function.  In  particular  Vd-2/Vd  =  (i/(27r).  It  follows  that 

Pt{{xi,X2)  &  A)  =  [  {1  -  xl  -  xl)^''‘~'^^^‘^dxidx2 

^  'd  J(xi,x2)eA 

=  {l-xl-xl)^<^-^^/^dx,dx2<^f 

2"^  J {xi,X2)GA  ^TT  J (^xi,X2)GA 

where  we  use  the  inequality  (1  —  x)  <  e~^. 

Lemma  A.2.2  Let  d  >  2  and  let  x  =  [xi, . . . ,  a;^]  be  uniformly  distributed  in  the  d-dimensional  unit  ball.  Given 
7i  G  [0, 1],  72  S  [0, 1],  we  have 

Pr((a;i,a;2)  G  [0,71]  x  [72,!])  < 

X  l\J'K 

Proof:  Let  A=  [0, 71]  x  [72,  !]■  We  have 


Pr((xi,a;2)  G  A)  < 


1  -  72, 


27T 

(tci  ,X2)GA 

7ldp-(d-2)7|/2 
27r 


X2^[l2A] 


g-(d-2).V2rf^  <  ]l^e-(d-2)7l/2^in 

27r 


xgIo,!— 72) 


2{d-2)_  ■ 


Note  that  when  d>2,  min(l,  •\/7r/(2(d  —  2)))  <  y/njd.  ■ 

Lemma  A.2.3  Assume  x  =  [a:i, . . . ,  Xd]  is  uniformly  distributed  in  the  d-dimensional  unit  ball.  Given  71  G  [0, 1], 
we  have 

Pr(xi  >  7i)  < 

X  Z 

Proof:  Let  A=  [71, 1]  [~lj  !]■  Using  a  polar  coordinate  transform,  we  have; 


Pr((a;i,a;2)  G  A) 


(xi  ,X2)GA 


9  \ 

X2)  ^  dxidx2 


_d_ 

27r 


(1  —  r^)  2  rdrdO 


(r,r  cos  [0,1]  X  [71,1] 


1 

27r 


ded{l-r'^)i 


{r,r  cos  [0,1]  X  [71,1] 


<  /  d0(i(l  -  r^)  2  =  0.5(1  -  7i)  2 . 

2rr  J 

(r,e)G  [71 ,1]  X  [-7r/2,7r/2] 

Using  inequality  (1  —  z)  <  e~^,  we  obtain  the  desired  bound.  ■ 

Lemma  A.2.4  Let  d  >  4  and  let  x  =  [a;i, . . . ,  Xd\  be  uniformly  distributed  in  the  d-dimensional  unit  ball.  Given 
7,  /3  >  0,  we  have: 


Pr(xi  <  0,  a:i  -f  (3x2  >  7)  < 


^(1  -b  y/—  lnmin(l,  /3))e  \ 


181 


Proof:  Let  a  =  j3^—2d~^  In 
Pr(a:i  <  0,  Xi  +  (3x2  >  j) 


min(l,  /?),  we  have 

<  Pr(a;i  <  —a,  xi  +  (3x2  >  7)  +  Pi'(a:i  G  [—a,  0],  xi  +  (3x2  >  7) 

X  X 

<  Pr(j;i  <  -a,X2  >  (a  +  7)//?)  +  Pr(a;i  G  [-a,{)],X2  >  ^ / (3) 

X  X 

<  \Pr(x2  >  (a +  7)//?)  +Pr(a:i  G  [Q,a],X2  >  l/(3) 

Z  X  X 

<  ig-<i(«+7)V(2/3")  Q!V^^-d7V(4;3")  ^ 

“  4  2^ 


< 


1 

-e  ^ 
4 


a^/d 

2y/TT 


_  d'y'^ 

e  ^ 


min(l,/3)  /3  a/— 2  In  min(l,  (3) 

4 


_  d-,2 

e  ^ . 


Lemma  A.2.5  Let  u  and  v  be  two  unit  vectors  in  ii/  and  assume  that  6{u,v)  <  /3  <  tt/2.  Let  d  >  4  and  let 
X  =  [xi, . . . ,  Xd]  be  uniformly  distributed  in  the  d-dimensional  unit  ball.  Consider  C  >  0  arbitrary,  let 


7  = 


2sin/3 


InC  +  In  1  +  /lnmax(l,  cos  /3/  sin/3) 


Then 


Pr  [(u  •  x)(w  •  x)  <  0,  jw  •  ccj  >  7]  < 

X 


sin/3 
C  cos  (3 


Proof:  We  rewrite  the  desired  probability  as 


2  Pr  [m  •  X  >  7,  u  •  a:  <  0] . 

X 

W.l.g.,  let  u  =  (1, 0,  0, 0)  and  w  =  (cos(d),  sin(0),  0, 0, 0).  For  x  =  [xi,  X2, x^]  we  have  u  -  x  =  Xi  and 
w  ■  X  =  cos(0)xi  +  sin(0)x2.  Using  this  representation  and  Lemma  A. 2.4,  we  obtain 


as  desired. 


Pr  [ic  •  X  >  7,  u  •  X  <  0]  =  Pr[cos(0)xi  +  sin(0)x2  >  7,  xi  <  0] 


<  Pr 


sin(/3)  7 

Xi  H - J-X2  >  - <  0 


cos(/3)  cos(/3)  ’ 


< 


sin/3 

2cos/3 


cos  (3 


1  + -1 /Inmaxll, - ^  e 

V  sin/3  ' 


sin/3  ^-1 
2cos/3 


A.3  Appendix  for  Chapter  7 

A.3.1  Concentration  Inequalities 

Here  is  the  McDiarmid  inequality  (see  1 103 1)  we  use  in  our  proofs: 

Theorem  A.3.1  Let  Yi,...,Yn  be  independent  random  variables  taking  values  in  some  set  A,  and  assume  that 
t  :  A"  — >  R  satisfies: 


sup  _  \t{yi,...,y„)  -f(2/i,-.,t/A-i,t/i,t/z+i,y„)|  <  Ct, 


182 


for  all  i,  1  <  i  <  n.  Then  for  all  ^  >  0  we  have: 


-27V  E  < 
>  7}  <  2e  '=1 


Here  is  also  a  consequence  of  the  Chernoff  bound  that  we  used  in  Lemma  7.4.4 
Theorem  A.3.2  Let  Xi, be  independent  Poisson  trials  such  that,  for  1  <  *  <  n,  Pr  [Xi  =  1]  =  5  cind  let 
X  —  X]r=i  Then  any  n'  we  have: 


Pr( 

1 

2 

>  emax{n,  n'}|  <  2e 


183 


184 


